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PREFACE 

This  book  is  intended  for  students  in  education,  who  usually 
have  had  little  training  in  mathematics.  For  those  who  have  had 
considerable  mathematics  the  theory  of  statistics  is  compara- 
tively easy,  but  for  students  without  such  training  the  more 
advanced  statistical  methods  offer  many  difficulties.  The  pres- 
ent volume  supplements  the  mathematical  preparation  of  the 
student  by  including  sections  on  such  topics  as  graphing,  loga- 
rithms, and  elementary  theory  of  probability.  The  proofs  of 
difficult  theorems  have  been  omitted  throughout  and  demon- 
strations have  been  included  only  when  experience  has  shown 
that  they  come  within  the  grasp  of  the  ordinary  student  and 
assist  in  a  clear  understanding  of  the  method  involved. 

Although  no  attempt  has  been  made  to  include  all  statistical 
methods  now  used  in  the  field  of  education,  the  present  text 
treats  a  somewhat  larger  number  than  will  be  found  in  most 
elementary  books.  The  chief  additions  to  the  usual  topics  are 
the  percentile  method,  application  of  the  normal  curve  in  cor- 
relating qualitative  series,  partial  and  multiple  correlation,  and 
elementary  theory  of  curve  fitting.  The  important  subject  of 
index  numbers  has  been  omitted  entirely  because  a  satisfactory 
treatment  is  beyond  the  scope  of  this  book.  The  increasing 
need  for  index  numbers  in  the  field  of  school  costs  will  probably 
lead  to  a  separate  volume  on  these  methods. 

In  order  to  insure  a  clear  understanding  of  the  statistical 
arithmetic  involved  in  the  various  methods  presented,  com- 
plete model  problems  have  been  worked  out  in  the  text.  The 
experience  of  the  writer  has  been  that  the  ordinary  student 
has  considerable  difficulty  in  formulating  his  plans  for  calcu- 
lation and  is  greatly  assisted  by  detailed  arithmetical  schemes 
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for  computation,  particularly  in  the  early  part  of  the  course.  A 
considerable  number  of  exercises  with  answers  have  been  added 
at  the  end  of  each  chapter  to  clarify  the  methods  discussed  and 
to  afford  the  student  sufficient  arithmetical  practice  to  enable 
him  to  become  accurate  in  his  work.  The  amount  of  such  prac- 
tice needed  varies  greatly  with  students,  and  enough  exercises 
are  included  to  meet  the  needs  of  those  requiring  most  drill. 

The  material  in  this  volume  will  be  found  sufficient  for  an 
ordinary  course  of  six  months,  but  it  may  be  condensed  for  a 
shorter  course  by  the  omission  of  certain  topics  and  chapters. 
For  an  introductory  course  in  a  normal  school  or  college.  Chap- 
ters I  to  IX  with  selected  topics  from  Chapters  XII,  XIII, 
and  XIV  are  suggested.  In  case  a  second  course  is  offered,  the 
last  seven  or  eight  chapters  with  supplementary  reading  and 
term  papers  will  usually  be  ample. 

The  writer  is  greatly  indebted  to  Professor  Karl  Pearson, 
Dr.  Leonard  P.  Ayres,  and  Professor  Harold  Rugg  for  ideas 
acquired  while  he  was  under  their  instruction.  Valuable  advice 
and  suggestions  have  also  been  contributed  by  Dr.  Egon  Pear- 
son, Professor  C.  H.  Judd,  Professor  F.  N.  Freeman,  Professor 
E.  R.  Breslich,  Dr.  Douglas  Scates,  and  Dr.  Ralph  Hogan,  all 
of  whom  read  the  manuscript  while  in  preparation.  Additional 
thanks  are  due  to  Mr.  Lumir  Brazda  for  preparation  of  the 
diagrams  and  to  Mrs.  Bryan  Mitchell  for  assistance  in  check- 
ing the  proof. 

KARL  J.  HOLZINGER 
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CHAPTER  I 

INTRODUCTION 

1.  The  Need  for  Statistical  Method  in  Dealing  with 

Educational  Problems 

In  recent  years  the  scientific  movement  in  education  has  led 
to  the  wide  use  of  quantitative  methods.  Problems  in  school 
administration  and  in  educational  theory  and  practice  are  now 
being  studied  chiefly  by  the  application  of  experimental  and 
statistical  technique. 

The  increasing  demand  for  school  surveys  and  the  generous 
appropriations  made  by  the  various  foundations  to  promote 
these  and  other  financial  inquiries  have  created  a  need  for 
statistical  training  for  persons  conducting  such  investigations. 
Some  of  the  outstanding  problems  in  such  studies  are  the  ap- 
portionment of  school  funds,  school  accounting,  unit  costs,  and 
budgetary  control,  all  of  which  involve  careful  accumulation  of 
data  and  application  of  appropriate  statistical  method. 

Another  field  in  which  adequate  knowledge  of  statistics  has 
become  imperative  is  that  of  standardized  tests.  In  modern 
educational  science  the  old  types  of  personal  estimate  and  school 
examination  are  being  replaced  by  intelligence  tests  and  scales 
for  the  measuring  of  achievement  in  the  various  school  sub- 
jects. Statistical  methods  are  fundamental  in  the  theory  of 
test  and  scale  construction  and  in  the  interpretation  of  the 

results  obtained  from  such  tests. 
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In  the  selection  and  organization  of  test  material  and  the 
standardization  and  preparation  in  final  form,  elaborate  tech- 
nique is  often  required.  Modern  developments  in  test  construc- 
tion have  led  to  the  use  of  more  and  more  refined  methods,  so 
that  the  test-maker  of  today  needs  to  be  a  thorough  student  of 
statistics. 

In  the  application  of  standardized  tests  to  such  problems  as 
pupil  classification,  vocational  guidance,  diagnosis  of  special  abil- 
ities, and  evaluation  of  methods  of  instruction,  a  sound  knowl- 
edge of  statistical  method  is  imperative,  because  all  such  studies 
involve  the  collection  of  appropriate  data,  summarization  of 
the  results,  and  correct  inferences  from  the  statistical  findings. 

The  quantitative  trend  in  school  investigation  has  given  rise 
to  a  tremendous  bulk  of  literature.  There  are  now  hundreds 
of  volumes  on  school  surveys  filled  with  tables  and  diagrams ; 
there  are  books,  monographs,  theses,  and  reports  likewise  re- 
plete with  statistics ;  there  are  scores  of  government,  state,  and 
institutional  pamphlets;  there  are  hundreds  of  standardized 
tests;  and  there  is  an  ever-increasing  amount  of  periodical 
literature  reporting  the  findings  of  quantitative  studies. 

It  is  evident  that  if  the  school  administrators  and  teachers 
for  whom  a  large  part  of  this  great  body  of  literature  was 
written  are  to  understand  and  apply  it,  they  must  have  con- 
siderable familiarity  with  statistical  method.  It  is  impossible 
to  keep  up  with  the  most  recent  developments  in  school  research 
without  some  knowledge  of  the  methods  upon  which  such  in- 
vestigations are  based. 

Professional  schools  and  departments  in  universities  devoted 
to  the  training  of  teachers  and  administrators  are  meeting  the 
demand  by  courses  in  experimental  and  statistical  method. 
The  purpose  of  such  courses,  in  general,  is  to  give  the  student 
sufficient  information  for  intelligent  reading  of  the  present 
quantitative  literature,  and  to  furnish  him  with  the  technique 
necessary  for  carrying  on  his  own  investigations.  This  twofold 
aim  has  been  kept  in  mind  in  preparing  the  present  text. 
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2.  Some  General  Requirements  for  Success  in  the 
Use  of  Statistical  Method 

In  conducting  a  statistical  study  the  investigator,  survey  ex- 
pert, or  classroom  teacher  should  have  in  mind  some  definite 
problem  or  purpose,  no  matter  how  limited  in  scope.  The 
mere  gathering  of  masses  of  data  or  the  haphazard  calcula- 
tion and  plotting  of  diagrams  are  of  little  value  unless  they 
can  be  brought  to  bear  upon  a  problem.  While  desirable  lines 
of  investigation  are  often  discovered  after  the  data  have  been 
collected  and  tabulated  in  a  tentative  way,  it  is  much  safer  to 
decide  upon  the  problem  first  and  then  proceed  to  collect  the 
data  necessary  for  its  solution.  The  selection  of  a  problem 
which  is  worth  while,  and  which  is  sufficiently  limited  so  that 
controls  may  be  made  and  all  necessary  details  carried  out 
thoroughly  and  completely,  is  perhaps  the  most  difficult  part 
of  the  whole  statistical  procedure.  It  requires  wide  knowledge 
of  the  general  field  in  which  the  problem  lies,  and  a  certain 
constructive  imagination  in  foreseeing  the  various  difficulties 
which  are  likely  to  arise. 

Another  requisite  for  a  good  statistical  investigation  is  ade- 
quate data.  No  matter  how  excellent  the  problem  or  the  plan 
of  procedure,  if  the  data  employed  are  scanty  the  results  will 
be  of  little  value.  Statistical  method  usually  involves  some 
generalization  based  upon  summaries  of  the  data.  If  the  data 
are  small  in  number,  therefore,  the  conclusions  drawn  will  not 
be  reliable.  This  may  be  illustrated  by  some  unpublished  ex- 
periments in  maze-learning  based  upon  about  twenty-five  cases. 
Out  of  eight  similar  studies  five  showed  a  superiority  for  one 
method  of  learning,  while  the  other  three  showed  a  difference  in 
favor  of  another  method.  In  all  the  experiments  the  number 
of  cases  was  so  small  that  none  of  the  differences  obtained 
proved  to  be  significant,  but  could  be  readily  accounted  for 
by  mere  chance  fluctuations  in  the  samples  of  data  chosen. 
While  there  is  no  fixed  number  of  cases  necessixry  for  making 
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a  statistical  study,  a  desirable  minimum  for  experimental  work 
is  about  fifty,  provided  they  are  well  chosen. 

Data  adequate  as  to  number  are  not  alone  sufficient  to  insure 
satisfactory  material.  The  facts  gathered  must  be  reliable  and 
pertinent  to  the  problem  in  hand.  Questionnaire  returns  often 
fail  in  this  respect  because  the  intelligent  replies  of  a  number  of 
persons  to  whom  the  blanks  are  sent  are  offset  by  careless  or 
random  answers  on  the  part  of  others.  Increasing  the  bulk  of 
such  data  is  not  likely  to  increase  its  reliability,  but  the  selecting 
of  even  a  smaller  number  of  persons  who  could  be  depended 
upon  to  give  careful  replies  would  yield  better  results.  Thus  if 
one  wished  to  discover  the  most  important  aims  in  the  teaching 
of  high-school  English,  returns  from  a  small  well-selected  group 
of  experienced  teachers  would  be  preferable  to  those  from  a 
much  larger  group  taken  at  random. 

It  frequently  happens  that  the  worker  loses  sight  of  the  fact 
that  his  data  are  inadequate  as  to  quantity  and  quality  and 
applies  elaborate  statistical  methods  with  the  expectation  that 
the  final  results  will  be  of  value.  Such  procedure,  if  followed 
intentionally,  has  been  rightly  described  as  ''hiding  behind  a 
statistical  smoke-screen,"  and  is  nothing  less  than  a  scientific 
crime.  The  limitations  of  the  data  employed  should  always  be 
frankly  recognized  and  the  conclusions  of  the  study  made  with 
them  in  mind.  No  amount  of  subsequent  juggling  by  compli- 
cated formulas  can  give  good  results  when  they  are  based  upon 
originally  faulty  data. 

The  successful  statistician  must  have  the  capacity  for  careful, 
painstaking,  and  scientifically  honest  work.  It  is  so  easy  to 
gather  a  few  figures  and  tabulate  them  in  such  a  way  as  to  show 
a  desired  result  or  ''prove"  a  certain  theory  that  the  tempta- 
tions on  the  path  of  scientific  rectitude  are  great.  The  untrained 
reader  is  often  so  bewildered  by  tables  and  diagrams  that  he  is 
incapable  of  verifying  the  method  or  the  inferences  in  a  statis- 
tical article  and  either  accepts  the  conclusions  on  the  reputation 
of  the  writer  or  perhaps  concludes  that  "anything  can  be  proved 
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by  statistics."  Educational  science  would  be  greatly  improved 
by  the  production  of  a  smaller  number  of  studies  based  upon 
better  data  and  a  more  cautious  use  of  statistical  method. 

A  final  requisite  for  the  successful  use  of  statistics  is  training 
in  methodology.  The  investigator  needs  to  become  familiar  with 
the  various  technical  methods  and  processes  of  calculation.  He 
needs  much  training  in  the  application  of  these  methods  to  data 
and  problems  in  the  particular  field  in  which  he  expects  to  work. 
He  also  needs  some  knowledge  of  the  difficult  field  of  statistical 
inference.  It  is  this  general  pedagogical  requirement  which  the 
textbook  and  course  in  statistics  are  expected  to  fulfill.  Such  a 
course  of  study  should  familiarize  the  student  with  methods 
appropriate  to  educational  problems,  insure  skill  in  statistical 
arithmetic,  and  provide  opportunity  for  working  out  a  worth- 
while problem  under  careful  guidance. 

3.  General  Statistical  Procedure  in  Dealing 
WITH  A  Problem 

While  there  is  no  set  order  in  which  the  steps  in  a  statistical 
study  must  be  carried  out,  experience  has  shown  that  a  sys- 
tematic procedure  like  the  following  is  logical  and  economical 
of  time  and  labor.  Most  of  these  steps  will  be  discussed  and 
fully  illustrated  in  subsequent  chapters. 

(1)  Planning  of  the  study.  When  the  student  has  some  prob- 
lem selected,  his  first  concern  will  be  with  a  rough  plan  for  the 
whole  study.  It  may  not  be  possible  to  define  the  problem  very 
specifically  until  the  data  have  been  gathered  and  examined, 
but  the  more  definitely  the  limits  of  the  inquiry  can  be  set  in 
advance  the  easier  will  be  the  subsequent  steps.  The  usual 
mistake  is  to  select  a  problem  much  too  broad  and  too  difficult 
for  any  one  individual  or  even  a  small  group  of  workers  to 
undertake  effectively.  The  availability,  sources,  accuracy,  and 
methods  of  gathering  data  should  all  be  considered  in  the 
preliminary  plan. 
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(2)  Collection  of  the  data.  With  the  problem  defined  and  a 
general  plan  made,  the  next  step  is  to  collect  the  necessary 
data.  This  is  accomplished  by  the  use  of  questionnaires,  by 
personal  tabulation  from  data  already  available  in  records,  or 
by  the  application  of  standardized  tests,  rating  schemes,  and 
other  such  measuring  devices  (Chapter  II). 

(3)  Preliminary  analysis  of  the  data.  If  a  questionnaire  has 
been  used  in  collecting  the  material,  it  is  usually  necessary  to 
examine  the  returns  very  carefully  before  making  tabulations. 
Incompleteness,  inaccuracy,  and  ambiguity  in  the  answers  given 
should  all  be  considered  before  the  data  are  used.  Similar  anal- 
ysis is  often  necessary  with  the  results  of  standardized  scales ; 
unusual  test  conditions  and  errors  in  giving  and  in  scoring  the 
tests  need  to  be  checked  up  before  tabulation  is  begun. 

A  preliminary  analysis  of  the  material  will  also  be  desirable 
in  many  cases  to  determine  whether  or  not  the  data  are  ade- 
quate for  the  problem  in  hand.  It  may  be  that  question-blank 
returns  from  a  certain  source  are  too  scanty  or  fail  to  appear  in 
such  a  form  as  to  meet  the  requirements  of  the  problem.  In 
such  cases  a  revised  blank  and  more  data  will  be  required 
(Chapter  II). 

(4)  Tabulation  for  primary  records.  After  the  preliminary  anal- 
ysis the  data  should  be  tabulated  in  such  a  way  as  to  form  both 
a  permanent  and  a  convenient  working  record.  The  permanent 
record  may  be  kept  in  a  bound  volume  with  a  page  to  each  case 
or  in  the  form  of  a  master  sheet  with  the  names  and  records  in 
parallel  columns.  The  working  record  is  usually  in  the  form 
of  small  cards.  One  of  these  is  made  out  for  each  case  and  the 
data  entered  in  compact  form  so  that  the  cards  may  be  readily 
sorted  and  the  resulting  distributions  easily  checked  (section  7, 
Chapter  II). 

(5)  Classification  of  the  material.  Distributions,  tables,  and 
serial  arrangements  may  next  be  made  from  the  primary  record. 
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These  furnish  the  basis  for  calculations  and  graphical  represen- 
tations of  the  material. 

(6)  Analysis  of  the  classified  data  and  planning  of  the  calcula- 
tions. The  particular  statistical  calculations  to  be  employed 
are  often  not  apparent  until  the  data  have  been  arranged  in 
systematic  form.  The  choice  and  right  use  of  the  proper  ana- 
lytical methods  are  extremely  important,  and  at  this  point 
sound  statistical  judgment  is  required. 

After  the  required  calculations  have  been  decided  upon,  they 
should  be  planned  throughout  before  computation  is  begun. 
This  is  particularly  advisable  with  data  involving  correlations 
(Chapters  IX  and  X),  the  tables  for  which  may  be  checked 
against  one  another  and  also  used  to  furnish  other  statistical 
quantities  such  as  the  averages  and  measures  of  variability 
(Chapters  VI  and  VII). 

(7)  Calculation  of  the  statistical  constants.  The  computations 
required  may  be  made  with  the  assistance  of  calculating  tables 
and  machines.  It  is  desirable  to  have  complete  checks  on  the 
arithmetical  accuracy  of  the  work.  Some  of  these  are  afforded 
by  formulas,  but  the  best  check  is  to  have  two  persons  perform 
the  calculations  independently  (Chapter  V). 

(8)  Interpretation  of  results.  The  study  has  now  reached  the 
point  where  a  careful  scrutiny  of  results  is  required.  These  need 
to  be  interpreted  in  terms  of  the  problem  in  hand.  If  the  in- 
vestigator is  fortunate,  the  results  may  come  out  in  such  a  way 
that  the  conclusions  to  be  dra^vn  are  clear-cut  and  unambiguous. 
Very  frequently,  however,  the  findings  are  incomplete  or  in- 
conclusive, so  that  it  is  necessary  to  make  inferences  with 
extreme  caution.  Careful  application  of  the  methods  of  statis- 
tical inference  will  then  be  necessary  in  order  to  guard  against 
unwarranted  generalizations  (Chapter  XIII). 

(9)  Presentation  of  results  in  tables  and  diagrams.  Before 
writing  the  report  most  workers  will  find  it  desirable  to  prepare 
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rough  sketches  of  the  tables  and  diagrams  to  be  used  in  the 
study.  It  is  often  convenient  to  cut  these  out  and  to  pin  them 
into  the  text  as  it  is  written  (Chapter  III). 

(10)  Writing  the  report.  A  satisfactory  report  will  usually 
parallel  in  a  general  way  the  steps  outlined  above.  It  will 
contain  a  statement  of  the  problem  and  its  setting  in  the  larger 
field;  a  description  of  the  group  studied;  an  account  of  the 
materials  and  methods  employed ;  the  results,  inferences,  and 
conclusions  of  the  study ;  and  a  summary  of  the  results  obtained. 

With  this  general  plan  in  mind  we  may  next  turn  to  a  detailed 
account  of  the  various  statistical  methods. 


CHAPTER  II 

COLLECTION  AND  CLASSIFICATION  OF  DATA 

1.  Primary  and  Secondary  Data 

The  raw  material  employed  in  statistical  studies  consists  in 
measurements  or  estimates  known  as  data,  which  are  numerical 
statements  of  facts  in  any  department  of  inquiry,  such  as  astron- 
omy, economics,  biology,  psychology,  and  education.  In  the 
last  field  examples  are  furnished  by  the  scores  of  pupils  on 
standardized  tests,  physical  measurements  of  children,  salaries 
of  teachers,  attendance  records,  etc. 

Data  from  whatever  source  may  be  described  as  primary  or 
secondary.  These  terms  are  used  in  statistical  method  in  much 
the  same  way  as  in  historical  research.  In  the  latter  field  a  fact 
taken  from  an  ordinary  text  is  considered  as  secondary  material 
because  it  is  removed  at  least  one  step  from  the  original  record. 
If  the  information  were  secured  first-hand  from  documentary 
sources  such  as  laws,  original  proceedings,  letters,  etc.,  it  would 
be  considered  as  primary  historical  data. 

In  the  case  of  statistical  method,  primary  data  may  be  de- 
scribed as  those  secured  from  questionnaires,  measurements,  or 
estimates  before  the  material  has  been  combined  or  treated  in 
any  way  so  as  to  obscure  the  units  or  method  of  collection. 
Secondary  data,  on  the  other  hand,  are  those  which  have  al- 
ready been  collected  and  tabulated  in  some  form  available  for 
use.  They  are  usually  removed  one  or  more  steps  from  the  form 
of  the  original  record,  and  hence  comparison  with  similar  ma- 
terial is  of  doubtful  significance. 

If  the  problem  were  to  determine  the  academic  training  of 

teachers  beyond  four  years  of  high  school,  primary  records 

might  consist  of  returns  from  a  large  sampling  of  individual 
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teachers'  replies  to  a  question  blank.  Secondary  data  for  such 
a  problem  could  be  secured  from  the  reports  of  state  superin- 
tendents. The  latter  type  of  material  would  be  relatively  easy 
to  obtain,  but  would  be  open  to  the  objection  that  the  types 
of  teachers,  units  of  tabulation,  and  other  factors  might  not  be 
comparable  in  the  various  reports. 

Primary  data  are  of  course  much  to  be  preferred  to  secondary 
material.  In  case  the  study  is  of  wide  scope,  however,  and  re- 
quires an  elaborate  plan  for  securing  the  facts,  the  work  of 
collection  will  usually  be  too  much  for  a  single  individual.  Such 
studies  are  often  subsidized  by  grants  from  public  and  private 
funds  so  that  a  staff  of  trained  workers  may  gather  the  material. 
Assistance  of  this  sort  is  particularly  necessary  in  the  field  of 
school  costs,*  where  differences  in  methods  of  accounting  re- 
quire personal  tabulation  of  the  data  directly  from  the  school 
records  and  invoices. 

Studies  which  involve  primary  data  and  which  may  be  effec- 
tively handled  by  a  single  person  include  experiments  with 
apparatus  or  standardized  tests,  questionnaire  investigations  of 
limited  scope,  and  intensive  problems  where  the  method  of 
personal  estimate  or  observation  is  required. 

2.  Some  Examples  of  Secondary  Source  Material 

The  student  who  wishes  to  employ  secondary  material  will 
find  a  large  amount  in  government  reports,  school  surveys, 
foundation  publications,  and  funded  inquiries.  The  Federal 
sources  include  the  annual  and  sundry  reports  of  the  United 
States  Bureaus  of  Census,  Education,  and  Labor.  Dr.  Leonard 
P.  Ayres  made  extended  use  of  such  material  in  preparing  his 
volume,  ''An  Index  Number  for  State  School  Systems,"  from 
Bureau  of  Education  reports.!  He  was  able  to  obtain  figures  on 

*  See  N.  B.  Henry,  A  Study  of  Public  School  Costs  in  Illinois  Cities.  The  Mac- 
millan  Company,  1924.  (This  is  one  of  the  studies  subsidized  by  the  Common- 
wealth Fund.) 

t  Leonard  P.  Ayres.  An  Index  Number  for  State  School  Systems.  Russell  Sage 
Foundation,  1920. 
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school  costs  and  attendance  for  all  the  states  running  back 
over  a  period  of  fifty  years. 

Reports  from  state  superintendent  and  state  departments 
are  often  useful  in  making  preliminary  studies  of  a  type  re- 
ported by  William  R.  Burgess  on  the  academic  preparation  of 
teachers.*  Dr.  Burgess  summarized  the  reports  from  fourteen 
states  and  found  that  the  average  teacher  in  1918  had  only 
one  and  one-quarter  years  of  training  beyond  high  school. 

School  surveys  furnish  a  very  valuable  source  of  compara- 
tive data,  but  the  variations  in  the  methods  employed  for  secur- 
ing the  financial  and  test  data  make  extreme  caution  necessary 
in  using  such  facts.  The  volume  of  the  Educational  Finance 
Inquiry  on  ''Financial  Statistics,"  prepared  by  Miss  Mabel 
Newcomer,  t  is  another  example  of  a  useful  compilation  for 
comparative  purposes.  Similar  studies  may  be  found  by  con- 
sulting the  extensive  bibliography  on  school  costs  prepared  by 
Dr.  Carter  Alexander.  { 

3.  Units  of  Collection 

In  gathering  statistical  data  it  is  usually  necessary  to  decide 
in  advance  upon  the  units  to  be  employed.  For  Dr.  Burgess's 
problem,  cited  above,  the  character,  ''teacher  training  beyond 
high  school,"  might  have  been  expressed  in  a  variety  of  units 
such  as  semester  hours,  quarters,  semesters,  or  years.  In  deal- 
ing with  normal-school  and  college  training  it  seemed  advisable 
to  him  to  consider  a  year  of  training  as  the  unit  no  matter 
where  taken.  The  choice  of  such  a  crude  unit  of  course  makes 
fine  comparisons  of  doubtful  significance.  Two  years  of  train- 
ing in  a  very  poor  normal  school  are  not  equivalent  to  two 
years  in  a  first-class  institution.  The  decision  as  to  how  rough 
a  unit  mhy  be  employed  will  depend  largely  upon  the  purpose 

*  W.  R.  Burgess,  "The  Education  of  Teachers  in  Fourteen  States,"  Journal  of 
Educational  Research,  March,  1921. 

t  Publications  of  the  Educational  Finance  Inquiry,  Vol.  VI.  The  Macmillan 
Company.  1924. 

X  Volume  IV  of  the  Educational  Finance  Inquiry.  The  Macmillan  Company,  1924. 
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of  the  study.  Dr.  Burgess  was  interested  not  in  individual  dif- 
ferences in  teacher  training  but  in  securing  an  approximate 
index  of  the  amount  of  such  training  in  a  whole  state.  For  such 
purposes  the  unit  employed  was  a  very  reasonable  one. 

With  test  data  the  units  of  collection  are  given  by  the  tests 
themselves  in  terms  of  points,  years  of  mental  or  educational 
age,  or  as  functions  of  group  variability  (see  Chapter  VII).  In 
recent  years  it  has  been  discovered  that  the  units  in  many 
earlier  scales  were  expressed  to  a  fictitious  degree  of  accuracy. 
Problems  in  a  ''scaled"  series  were  assigned  values  such  as  3.24 
under  the  assumption  that  abilities  could  be  measured  with  an 
accuracy  of  one-hundredth  of  a  ''probable  error"  unit,  as  it  is 
called.  The  instability  of  mental  characters  makes  such  preci- 
sion unwarranted.  Another  measurement  of  the  same  person 
would  probably  differ  from  the  previous  one  by  a  whole  unit  of 
" probable  error."  For  most  test  material  the  simple  unweighted 
item  furnishes  a  unit  which  is  sufficiently  accurate  for  all  statis- 
tical purposes,  although  derived  scores  such  as  mental  or  educa- 
tional ages  are  often  very  convenient. 

In  the  case  of  stable  characters,  such  as  height,  greater  care  is 
needed  in  determining  the  unit  of  measurement  to  be  employed. 
The  classification  of  the  data  and  the  comparisons  which  follow 
will  depend  upon  the  degree  of  accuracy  in  the  original  material. 
It  is  usually  best  to  make  the  measurements  somewhat  finer 
than  the  unit  to  be  employed  later  in  grouping  the  data.  Thus 
if  heights  are  to  be  classified  in  one-inch  intervals  the  measure- 
ments might  be  made  to  the  nearest  quarter  or  eighth  of  an 
inch.  This  insures  a  fairly  even  distribution  of  the  observations 
over  the  intervals  as  shown  in  section  8. 

4.  Types  of  Series 

A  statistical  series  may  be  described  as  a  set  of  data  the  items 
of  which  have  some  common  feature,  or  character.  Examples  of 
widely  different  characters  are  height,  intelligence,  and  religious 
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denomination.  When  measurements  of  height,  estimates  of 
intelHgence,  or  verbal  descriptions  of  rehgious  denomination 
are  made,  the  resulting  data  may  be  regarded  as  a  statistical 
series.  It  is  the  function  of  statistics  to  summarize,  compare, 
and  draw  inferences  from  such  series  relative  to  some  problem. 

In  order  that  appropriate  methods  may  be  used  with  different 
sorts  of  data,  it  will  be  desirable  to  have  a  classification  of  the 
various  types  of  statistical  series  which  may  arise.  The  basis 
of  this  classification  is  in  the  representation  of  the  character 
as  next  described. 

The  three  characters  cited  above  differ  in  certain  important 
respects.  Height  and  intelligence  are  termed  ordered  characters^ 
because  the  amount  or  degree  of  either  trait  may  vary  in  an 
orderly  manner.  Religious  denomination,  on  the  other  hand, 
does  not  lend  itself  to  such  gradation,  but  must  be  described  in 
verbal  categories  not  related  to  one  another  in  any  orderly 
fashion;  that  is,  it  would  be  indifferent  whether  ''Congrega- 
tional "  be  placed  before  or  after  ''  Methodist "  in  a  classification. 
Such  characters  may  be  called  unordered. 

A  second  difference  between  the  above  characters  arises  from 
the  manner  in  which  the  amounts,  degrees,  or  categories  are 
described,  or,  more  briefly,  according  to  the  mode  of  indexing 
of  the  character.  These  modes  of  indexing  may  be  numerical 
or  verbal.  Thus  height  is  said  to  be  numerically  indexed  when 
various  amounts  for  different  individuals  are  expressed  by  the 
numbers  arising  from  some  scale,  that  is,  by  measurements. 
Intelligence  may  be  indexed  numerically  or  verbally.  In  the 
former  case  mental  ages  or  intelligence  quotients  could  be  em- 
ployed, while,  for  the  latter  mode  of  indexing,  verbal  descrip- 
tions, as ''inferior,"  "medium,"  and  "high,"  might  be  used.  The 
limits  of  these  categories  may  or  may  not  be  stated  in  numerical 
terms  according  to  the  manner  in  which  the  data  are  gathered. 
As  already  noted,  the  three  categories  of  intelligence  can  be 
arranged  in  an  orderly  fashion,  but  in  the  case  of  religious  de- 
nomination the  verbal  categories  could  be  placed  in  any  order. 
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A  third  distinction  may  be  made  between  continuous  and  dis- 
continuous characters.  For  the  former  the  unit  of  measurement 
may  be  made  as  fine  as  we  please,  while  determinations  for  the 
latter  type  must  always  be  given  in  integers  (whole  numbers). 
Thus  height  is  a  continuous  character,  because  gradations  of 
height  vary  continuously ;  but  size  of  class  would  be  regarded  as  a 
discontinuous  character,  since  fractional  class  sizes  cannot  occur. 

The  preceding  analysis  of  characters  may  now  be  used  to  de- 
termine a  classification  of  statistical  series,  the  various  types 
depending  upon  the  ordering  and  indexing  of  the  characters 
involved.  Thus  when  the  character  is  ordered  and  numerically 
indexed  the  resulting  series  will  be  called  quantitative ;  when  the 
character  is  ordered  and  verbally  indexed  the  series  will  be 
termed  qualitative ;  whereas  for  an  unordered  and  verbally  in- 
dexed character  the  series  will  be  designated  as  unordered. 

It  should  be  noted  that  the  basis  of  the  present  classifica- 
tion is  in  the  ordering  and  indexing  of  the  character  and  not 
in  the  nature  of  the  trait  itself.  Thus  both  speei  and  quality 
of  handwriting  may  furnish  either  quantitative  or  qualitative 
series  because  both  may  be  either  measured  or  verbally  de- 
scribed. An  objection  may  be  made  to  the  use  of  the  term 
"qualitative"  to  describe  series  where  the  trait  itself  is  not  a 
quality  in  the  ordinary  sense.  This  term,  however,  seems  a 
convenient  one  for  verbal  and  ordered  characters,  and  no  con- 
fusion need  arise  if  it  be  understood  that  "  qualitative"  is  merely 
a  label  for  such  series. 

The  table  on  page  15  gives  a  brief  summary  of  the  above 
types  of  series. 

5.  Methods  of  Collecting  Data 

If  secondary  data  are  employed  in  a  statistical  study,  it  is 
only  necessary  to  assemble  the  material  from  the  records  in 
some  form  convenient  for  use.  Primary  data  may  also  be  col- 
lected by  tabulation  of  existing  records,  but  they  are  usually 
gathered  by  enumeration  or  counting  in  such  problems  as  school 
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Table  1.   Classification  of  Series* 


Ordering  and  Indexing  of  the 
Character 

Resulting  Statistical 
Series 

Example 

Ordered 

Indexed  numerically 

Indexed  verbally 

Unordered 
Indexed  verbally  (possibly  numer- 
ically) 

Quantitative 
Continuous 
Discontinuous 

Qualitative 

Unordered 

Test  scores 
Size  of  classes 
Estimates  of  intelli- 
gence 

Classification  of  re- 
ligion, race,  occu- 
pation, etc. 

census,  by  estimation  as  in  experimental  work  and  the  appraisal 
of  teaching  efficiency,  by  measurement  with  physical  and  mental 
scales,  or  by  questionnaires  with  inquiries  of  broad  scope. 

The  particular  method  of  collection  to  be  employed  will  de- 
pend upon  the  problem  and  the  availability  of  the  data.  It  is 
usually  best  to  avoid  such  indirect  methods  for  securing  data 
as  the  questionnaire  when  it  is  to  be  filled  in  from  printed  direc- 
tions. The  dangers  of  securing  incomplete,  unrepresentative, 
and  faulty  information  from  such  sources  are  very  great. 

In  collecting  data  by  enumeration,  estimate,  or  measurement 
it  is  desirable  that  the  work  be  done  by  trained  persons  and  by 
uniform  methods.    For  a  problem  in  child  accounting,  knowl- 


*  In  his  "Introduction  to  Statistics"  Mr.  G.  U.  Yule  distinguishes  between 
statistics  of  attributes  and  statistics  of  variables.  For  the  former  the  observer  notes 
only  the  presence  or  absence  of  some  attribute  and  counts  the  number  of  individuals 
who  do  or  do  not  possess  it ;  for  the  latter  type,  determinations  of  some  variable 
are  made.  Examples  given  for  statistics  of  attributes  are  the  number  of  blind  and 
not  blind,  sane  and  insane,  or  tall  and  short  persons.  Measurements  of  height  or 
weight  furnish  the  data  for  statistics  of  variables. 

The  twofold  classification  given  by  attributes  is  rather  restrictive  and  leaves  open 
to  doubt  the  designation  of  many  series  which  may  arise.  Thus  if  another  group 
"medium"  be  added  to  "tall"  and  "short,"  we  are  at  a  loss  to  know  whether  the 
resulting  series  is  to  be  classified  under  attributes  or  variables.  Again,  if  we  consider 
the  disabilities  "blindness,"  "deafness,"  and  "insanity"  the  same  question  arises. 
It  appears  more  satisfactory,  therefore,  to  consider  the  above  series  given  by  height 
as  qualitative,  since  this  character  is  ordered  and  verbally  indexed,  and  to  designate 
the  disability  series  as  unordered  on  account  of  the  indifferent  arrangement  of  the 
three  classes. 


16  STATISTICAL  METHODS  IN  EDUCATION 

edge  of  the  terms  and  methods  in  this  field  would  be  necessary 
before  comparable  data  could  be  collected  from  various  sources. 
In  the  appraisal  of  methods  of  classroom  instruction  a  uniform 
system  and  a  careful  technique  of  observation  would  be  re- 
quired, while  for  problems  involving  the  use  of  standardized 
tests  the  service  of  trained  workers  in  the  administration  of 
such  scales  is  usually  needed.  No  elaborate  statistical  treatment 
can  correct  the  faults  of  poor  original  data,  and  anything  that  can 
he  done  therefore  to  improve  the  reliability  and  accuracy  of  the 
material  is  time  well  spent, 

6.  Sampling 

By  the  sampling  process  is  meant  the  use  of  a  sample  or  por- 
tion of  a  larger  universe  of  material  taken  for  the  purpose  of 
drawing  conclusions  as  to  the  whole.  Thus  if  age  norms  are  to 
be  prepared  for  a  certain  test  it  is  clearly  impossible  to  examine 
all  children  of  the  required  ages.  It  is  therefore  necessary  to 
base  the  averages  upon  representative  samples  taken  from  the 
larger  universe  or  population.  If  the  samples  are  fairly  large 
and  properly  chosen,  the  results  will  not  only  be  very  close 
to  those  which  would  have  been  obtained  from  the  whole 
population,  but  it  is  also  possible  to  predict  from  the  sample 
the  range  within  which  the  true  value  will  very  probably  lie 
(see  Chapter  XIII).  This  makes  it  possible  for  the  statisti- 
cian to  generalize  beyond  his  actual  data,  and  to  express  the 
so-called  ''reliability"  of  his  result  in  terms  of  mathematical 
probability. 

The  principle  behind  the  sampling  process  is  that  a  fairly 
large  number  of  items  chosen  at  random  from  a  large  group  or 
population  is  very  likely  to  have  the  characteristics  of  the  whole 
population.  This  may  be  called  the  Law  of  Statistical  Regularity 
for  Large  Nuynbers. 

A  simple  illustration  of  this  law  is  furnished  by  an  experiment 
to  determine  the  percentage  of  Ford  cars  appearing  on  a  south- 
side  boulevard  in  Chicago.    The  results  found  were  typical  of 
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what  might  be  considered  a  universe  of  Fords  frequenting  that 
part  of  the  city.  The  method  adopted  was  to  count  100  passing 
cars  and  note  the  number  of  Fords  in  such  a  sample  with  the 
following  record : 

Table  2.   Data  on  Ford  Experiment 


Percentage  of  Fords  Observed  in  Any 

Number  of 

Samples  of  100  Cars  Each  for 

One  Evening's  Sample  of  100  Cars 

A  Given  Percentage  of  Fords 

28 

1 

27 

— 

26 

— 

25 

2 

24 

— 

23 

2 

22 

4 

21 

5 

20 

3 

19 

2 

18 

2 

17 

2 

Total 

...   23 

The  average  and  most  frequent  percentage  of  Fords  was  21, 
with  a  maximum  variation  in  the  other  samples  of  only  7  per 
cent.  Any  one  of  the  twenty-three  samples  then  gives  a  fairly 
good  indication  of  the  required  percentage.  It  must  be  kept 
in  mind,  however,  that  the  above  results  were  for  only  one 
section  of  Chicago  during  a  certain  time  of  the  day  and  for  only 
one  season  of  the  year.  Three  samples  taken  on  the  north  side 
two  months  later  gave  a  percentage  of  only  eight. 

Another  example  illustrating  the  law  of  regularity  in  sampling 
is  furnished  by  Mr.  Ben  Wood.*  Cards  for  6468  boys  were  filled 
with  information  regarding  guardianship,  number  of  children 
in  the  home,  and  other  similar  data.  By  putting  the  cards  in 
alphabetical  order  and  selecting  every  fourth  one  Mr.  Wood 
was  able  to  secure  a  sample  the  characteristics  of  which  were 
in  remarkably  close  agreement  with  those  for  the  whole  group. 

♦Ben  Wood,  "The  Reliability  of  Prediction  of  Proportions  on  the  Basis  of 
Random  Sampling,"  Journal  of  Educational  Research,  December,  1921. 


18 


STATISTICAL  METHODS  IN  EDUCATION 


A  modified  portion  of  his  tables  shows  that  a  quarter  of  the 
cards,  chosen  as  they  were,  was  an  adequate  sample  for  com- 
parative purposes. 

Table  3.    Per  Cent  of  Boys  living  under  Various  Home  Conditions 


Portion  op  6468  Cards 

>  USED 

Item 

One  Fourth 

Three  Fourths 

All 

I.  Guardian 

Father 

83.4 

82.4 

82.4 

Mother      

13.3 

14.1 

13.9 

Uncle 

0.6 

0.4 

0.6 
0.2 

0.6 

Aunt 

0.2 

Stepfather     

0.7 

0.9 

0.9 

Stepmother 

0.2 

0.1 

0.2 

II.  Number  of  children  in  family- 

One     

6.0 

6.3 

6.3 

Two 

11.3 

14.8 
13.6 

11.8 
13.7 

14.4 

11.7 

Three 

13.9 

Four 

14.2 

Five 

14.3 

14.6 

14.5 

Six      

11.9 
9.8 

12.6 
10.5 

12.4 

Seven     

10.3 

In  securing  a  random  sample  the  principle  to  be  kept  in  mind 
is  that  every  individual  in  the  group  should  have  the  same  (or 
nearly  the  same)  chance  of  being  included  in  the  sample.  This 
is  accomplished  in  several  ways.  One  plan  is  to  mix  the  data 
very  thoroughly  and  then  take  a  limited  portion  of  them.  This 
procedure  is  exemplified  in  the  shuffling  and  dealing  in  ordinary 
card-playing.  The  purpose  of  the  mixing  or  shuffling  is  to  pro- 
duce what  is  called  a  random  distribution,  a  portion  of  which 
furnishes  the  random  sample.  Such  distributions  are  assumed 
to  be  already  existent  in  many  problems,  such  as  that  of  the 
motor  cars,  where  the  arrangement  of  a  given  one  hundred  cars 
was  affected  by  many  chance  factors.  The  same  assumption  is 
made  in  measuring  rainfall.  Although  the  drops  fall  unevenly 
they  tend  to  moisten  given  areas  equally  in  the  long  run,  and 
hence  a  gauge  of  a  certain  area  furnishes  a  random  sample.  It 
is,  of  course,  true  that  for  large  cities  such  as  Chicago,  samples 
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from  different  parts  of  the  city  need  to  be  taken  in  order  to  get 
an  adequate  measure  of  the  rainfall  for  the  whole  city.  On 
numerous  occasions  it  has  rained  in  one  part  of  the  city  and 
not  in  others  at  the  same  time. 

Good  results  may  often  be  secured  by  taking  the  items  at 
regular  intervals  after  the  material  has  been  arranged  in  some 
order.  In  Mr.  Wood's  experiment  every  fourth  card  was  se- 
lected under  alphabetical  arrangement.  This  plan  is  usually 
satisfactory  unless  there  is  some  reason  to  expect  a  relationship 
between  the  character  studied  and  alphabetical  order.  Thus,  in 
a  study  involving  pupil  recitation  there  might  be  a  tendency 
on  the  part  of  some  teachers  to  call  more  frequently  on  pupils 
whose  names  begin  with  the  earlier  letters  of  the  alphabet. 

If  the  population  sampled  contains  a  number  of  types,  a 
purely  random  sample  of  the  whole  is  probably  not  best  be- 
cause some  of  the  types  may  be  omitted  or  not  fairly  repre- 
sented. For  such  problems  sub-samples  proportional  in  size  to 
the  numbers  in  the  various  types  should  be  selected.  For  ex- 
ample, in  a  study  of  high-school  pupils  samples  from  each  ^  of 
the  four  years  might  be  chosen  and  combined,  the  size  of  the 
samples  being  taken  proportional  to  the  relative  numbers  of 
pupils  in  the  four  high-school  classes. 

The  size  of  the  sample  will  depend  upon  the  degree  of  accu- 
racy required  in  the  result,  the  precision  varying  as  the  square 
root  of  the  number  of  cases.  As  indicated  in  the  first  chapter, 
forty  to  sixty  cases  are  as  few  as  can  be  expected  to  yield  good 
results  in  experimental  work.  When  only  fifteen  or  twenty  are 
used,  the  application  of  the  usual  laws  of  sampling  becomes 
very  doubtful. 

7.  Arrangement  of  the  Original  Data 

The  form  of  the  permanent  and  working  records  will  depend 
upon  the  number  of  data  employed.  The  master  sheet,  as  shown 
in  Exercise  1  at  the  end  of  this  chapter,  is  advisable  for  samples 
of  fifty  to  one  hundred  cases.   With  a  large  amount  of  material, 
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/Case 
No. 

C.  A. 

16.4 

M.  A. 
16.4 

I.  Q. 

100 

5.31 

38 

.142 

51 

13.2 

65.5 

252. 

.974 

13.2 

37 

18.91 

47.31 

14.67 

68.1 

.061 

1.121 

4.2 

72.4 

39.12 

9.8 

16.5 

37.41 

.22 

2.65 

207. 

.0111 

however,  a  uniform  blank  card  or  page  is  usually  required.  Sir 
Francis  Galton  kept  a  record  of  his  data  in  large  bound  vol- 
umes with  a  page  for  each  person  examined,  the  age,  profession, 
nationality,  and  the  results  of  various  mental  and  physical  tests 
being  set  down  in  the  appropriate  spaces. 

If  the  series  is  short,  the  talljdng  and  distributions  may  be 
made  directly  from  the  master  sheet  by  running  down  the  page 
and  checking  off  the  items.  This  method,  however,  makes  it 
necessary  to  go  over  the  whole  list  to  catch  a  single  error  and 
is  rather  awkward  for  the  preparation  of  correlation  tables 

because  the  order  and  accuracy 
of  entry  produce  a  strain  on  the 
attention. 

The  preparation  of  small  tick- 
ets for  a  working  record  will 
overcome  most  of  the  above  dif- 
ficulties. These  cards  should  be 
fairly  thin,  of  uniform  size,  and 
have  a  comer  cut  off  to  facilitate 
the  separation  into  piles  during  the  sorting.  The  data  are  set 
down  from  the  permanent  record  in  the  form  of  numbers  with 
a  definite  spatial  arrangement  on  the  card.  In  order  to  identify 
the  tickets  with  the  permanent  record  the  case  number  should 
appear  on  a  corner  of  the  card.  This  will  make  it  possible  to 
prepare  a  duplicate  in  case  a  card  is  lost.  The  size  of  the  card 
will  depend  upon  the  number  of  items  entered,  but  it  should  be 
as  small  as  can  be  handled  conveniently.  A  key  for  the  items 
entered  will  of  course  be  required  for  a  sample  card  such  as  the 
one  shown  in  Fig.  1.  In  sorting,  the  characters  are  easily  iden- 
tified by  their  position  on  the  card. 

In  case  the  work  of  tabulation  and  sorting  is  to  be  done  by 
mechanical  devices  such  as  the  Hollerith  Machine,  the  data 
card  will  be  a  convenient  record  when  punching  the  informa- 
tion for  the  tabulating  card.  The  holes  in  this  card  (Fig.  2) 
make  it  possible  to  sort  very  rapidly  by  electrical  contact. 


Fig.  1.   Data  Card 
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8.  The  Simple  Frequency  Distribution 

In  dealing  with  a  large  body  of  data  it  is  necessary  to  classify 
the  material  in  some  compact  and  orderly  form  before  it  can  be 
effectively  analyzed.  The  frequency  distribution  is  the  most 
convenient  arrangement  for  the  material,  because  it  reveals 
some  of  the  most  important  properties  at  a  glance  and  makes 
all  of  the  calculations  very  much  easier  than  would  be  possible 
with  the  ungrouped  items.  A  simple  frequency  distribution  con- 
sists of  a  series  of  classes  of  the  character  and  a  set  of  correspond- 
ing frequencies.  In  the  case  of  a  quantitative  series  the  scale  is 
usually  divided  into  a  number  of  classes  of  equal  width,  for  ex- 
ample, 54.5  to  59.5,  59.5  to  64.5,  64.5  to  69.5,  etc.  The  number 
of  items  or  measures  (called  the  frequency)  occurring  in  each 
interval  is  then  determined  by  tallying.  For  qualitative  or 
unordered  series  the  classes  are  indicated  verbally  and  the 
frequencies  tabulated  as  in  the  first  case. 

The  ancient  method  of  tallying  is  to  record  the  frequencies  by 
strokes  until  four  have  been  made  and  then  to  make  a  cross 
stroke.    This  makes  it  easy  to  count  the  marks.    For  example : 


Class  Tally  Frequency 


64.5-69.5 //tV-  /  6 

59.5-64.5 -H^  -hhhh  I  11 

54.5-59.5 -U-N-  II  7 


The  tally  marks,  of  course,  should  not  appear  in  the  final 
distribution. 

The  steps  in  making  a  frequency  distribution  for  a  quantita- 
tive series  consist  in  (1)  noting  the  range  of  the  data,  that  is, 
the  distance  between  smallest  and  largest  items ;  (2)  deciding 
upon  the  number  of  classes  into  which  the  material  is  to  be 
grouped ;  (3)  determining  the  numerical  limits  of  the  classes ; 
and  (4)  tallying  the  frequencies  in  the  appropriate  classes. 
Steps  (2)  and  (3)  are  important  because  all  of  the  subsequent 
calculations  will  be  affected  by  the  width  and  limits  of  the 
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classes.  The  data,  when  grouped,  are  considered  to  be  either 
concentrated  at  the  midpoints  of  the  intervals  or  spread  evenly 
over  them.  The  calculations  from  grouped  data  will  then  not 
agree  exactly  with  those  from  ungrouped  series  unless  the  width 
of  the  classes  is  equal  to  the  collection  unit.  If  the  grouping 
is  this  fine,  however,  the  classes  may  be  so  numerous  that  the 
advantage  of  employing  a  distribution  is  lost  and  the  frequencies 
are  likely  to  present  a  very  irregular  appearance,  not  typical  of 
the  continuous  gradation  expected  from  ordered  characters.  It 
is  therefore  better  to  use  a  wider  interval  smoothing  out  the 
accidental  irregularities,  probably  due  to  sampling,  and  making 
the  subsequent  calculation  easier  although  slightly  less  accurate. 
When  there  are  from  fifteen  to  twenty-five  classes  with  material 
consisting  of  one  hundred  or  more  items,  the  error  due  to  group- 
ing is  very  slight,  and  even  this  may  be  adjusted  by  certain 
corrections  (see  Chapter  XVI,  section  8). 

The  choice  of  class  limits  depends  upon  the  accuracy  of  the 
original  data.  If  the  measurements  are  very  fine  and  the  classes 
fairly  broad,  the  limits  of  the  classes  may  be  expressed  in  the 
form  55-59.99,  60-64.99,  65-69.99,  etc.  This  method  makes  it 
possible  to  assign  measurements  very  definitely  to  the  appro- 
priate classes,  since  all  items  equal  to  the  lower  limit  and  up  to 
but  not  including  the  upper  limit  are  located  in  a  given  class. 
One  difficulty  with  this  designation,  however,  is  that  confusion 
sometimes  arises  regarding  the  numerical  value  of  the  upper- 
class  limits  in  calculation.  Students  may  take  these  to  be  actu- 
ally 59.99,  64.99,  etc.  An  alternative  plan  is  to  write  the  classes 
in  the  form  55-60 ~,  60-65",  65-70 ~,  etc.,  with  the  understand- 
ing that  60  ~  is  equal  to  60  for  purposes  of  calculation,  but 
means  just  less  than  60  in  the  tabulation. 

A  more  important  objection  to  the  above  method  arises  when 
the  measurements  are  not  very  fine.  If  we  assume  that  the 
items  are  given  correct  to  the  nearest  integer,  an  even  distribu- 
tion of  the  observations  over  a  class  interval  would  be  repre- 
sented as  shown  in  Fig.  3,  p.  24. 
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Fig.  3.   Illustrating  the  incorrect  class 

limits  which  appear  when  the  items  are 

measured  to  the  nearest  integer 


The  class  values,  or  midpoints  of  the  interv^als  used  in  subse- 
quent calculations,  will  thus  be  57.5,  62.5,  etc.  These  values, 
however,  are  not  representative  of  the  items  in  the  respective 

classes.  In  the  first  interval, 
for  example,  there  are  three 
items  below  57.5  and  only 
two  above,  spacing  the  ob- 
servations unequally  about 
this  class  value. 

In  order  to  adjust  for  this 
difficulty,  the  class  limits 
should  be  set  as  in  the  diagram  shown  in  Fig.  4  by  moving  one 
half  of  the  collection  unit  down  on  the  scale.  The  location  of  any 
frequency  is  as  uniquely 
determined  as  before,  and 
the  class  values  57,  62,  etc. 
are  more  truly  represent- 
ative of  the  items  in  the 
intervals  54.5-59.5,  59.5- 
64.5,  etc. 

If  the  measurements  are 
so  fine  that  practically  the 

whole  interval  55-59.99  could  be  filled  with  obser^^ations,  then 
57.5  would  be  an  approximately  correct  class  value  and  the 


54  [  55    56     57     5S     5 9 1 60     61    62     63    64j65 


54.5 


+ 


-*59.5  59.5- 


f 


-*64.5 


Class  value 
57 


Class  value 
62 


Fig.  4.    Illustrating  the  correct  method 

of  stating  class  limits  when  the  items  are 

given  to  the  nearest  integer 


55 


56 


"i 


58 


59 


60 


Class  value 
57.5 


55 


56 


57 

Class  value 
57 


58 


59 


60 


65 


-^  60 


54.5 


59.6 


a  b 

Fig.  5.   Illustrating  class  values  and  limits  with  measurements 
a,  very  fine ;  b,  to  nearest  integer 

limits  55-60"  satisfactory.  If,  on  the  other  hand,  the  meas- 
urements are  rather  coarse,  the  size  of  the  collection  unit  needs 
to  be  taken  into  account  by  moving  the  interval  back  one  half 
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the  amount  of  this  unit.  If  this  is  not  done,  all  class  values 
(and  the  resulting  average  for  the  whole  series)  will  be  one 
half  of  the  collection  unit  too  large. 

The  following  frequency  distribution  has  been  made  from  the 
Otis  Test  Scores  appearing  in  Exercise  1  at  the  end  of  this 
chapter.  Inasmuch  as  the  scores  are  given  to  the  nearest  inte- 
ger (point),  the  classes  will  run  from  79.5-89.5,  89.5-99.5,  etc. 

Table  4.  Frequency  Distribution  for  Otis  Test  Scores 


Class 

Tally 

Frequency 

179.5-189.5 

169.5-179.5 

159.5-169.5 

149.5-159.5 

/ 
/ 

nil 

-H-H--hhH-  1 

-H-H-  nil 

-HH--hhhl  1 
//// 

1 

1 

4 

11 

139.5-149.5 

9 

129.5-139.5 

119  5   129  5 

11 
5 
4 
2 

109.5-119.5 

99.5-109.5 

1 1 1 1 

nil 
II 
1 
1 

89.5-99.5 

79.5-89.0 

1 

1 

Total  50 

It  might  be  argued  that  a  person  receiving  a  score  of  80  could 
not  have  done  less  than  the  amount  required  to  receive  such  a 
score,  and  that  he  very  probably  did  a  little  more,  so  that  his 
truer  score  for  that  performance  should  be  80.5  instead  of  80. 
This  reasoning  would  lead  to  class  intervals,  80-90",  90-100", 
etc.,  but  would  be  contrary  to  the  usual  practice  of  taking  scores 
at  their  face  value.  In  the  present  discussion,  therefore,  we 
shall  assume  that  scores  are  correct  to  the  nearest  integer. 

It  will  be  noted  that  only  eleven  classes  were  used  in  this 
distribution  because  of  the  small  number  of  cases  involved. 


9.  The  Classifier 

In  dealing  with  small  samples  it  is  frequently  desirable  to 
rank  the  items  and  to  prepare  short  frequency  distributions. 
For  such  purposes  a  device  known  as  the  classifier  will  be  found 
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very  convenient.  It  consists  of  tabular  arrays  of  small  cells 
identified  by  units'  digits  on  one  axis,  and  by  tens'  digits  on  the 
other.  The  location  of  any  item  is  then  readily  indicated  by  a 
tally  mark  in  the  appropriate  cell.  The  accompanying  classifier 
has  been  made  for  the  Otis  scores  given  in  Exercise  1,  p.  29. 

Table  5.   Classifier  for  Otis  Test  Scores  * 


Tens 

Units 

Totals 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

18 

/ 
1 

1 

17 

/ 
2 

1 

16 

/ 
6 

/ 
5 

/ 
4 

/ 
3 

4 

15 

// 
16.5 

// 
14.5 

/// 
12 

/ 
10 

/ 
9 

/ 
8 

/ 

7 

11 

14 

/  M 
26  ^ 

d.  / 
^  25 

/ 
24 

/ 
23 

// 
21.5 

// 
19.5 

/ 
18 

9 

13 

// 
36.5 

/ 
35 

// 
33.5 

/ 
32 

/ 
31 

/ 
30 

/// 
28 

11 

12 

/ 
42 

/ 
41 

/ 
40 

// 
38.5 

5 

11 

/ 
46 

// 
44.5 

/ 
43 

4 

10 

/ 
48 

/ 
47 

2 

9 

/ 
49 

1 

8 

/ 
50 

1 

Totals  . 

3 

9 

5 

3 

5 

5 

6 

7 

4 

3 

50 

*  This  useful  device  was  first  brought  to  the  attention  of  the  writer  by  Dr. 
Leonard  P.  Ayres  in  a  series  of  lectures  given  at  The  University  of  Chicago  in  1920. 
It  is  recommended  for  use  when  dealing  with  fifty  to  one  hundred  cases. 
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~         :  IBS  are  entered  in  the  classifier  jiBt  as  tfaey  oome  6t^ 

^^er  sheet-  Thus  the  first  Otis  sooie  from  the  list  is  171. 

1  ^.  the  Ieft-4iand  margin  for  the  tens'  digit  17,  and 

.nder  the  miits'  digit  1,  locates  the  talhr  in  the 

cdL  To  dieck  the  wcMrk  the  tallying  may  be  repeated 

by  making  a  small  dot  over  eadi  tally  stroke. 

It  wiQ  be  noted  that  the  material  has  beoi  arranged  in  classes 
ten  miits  in  width  and  that  the  distribution  in  the  totals  on  the 
light  is  the  same  as  that  found  in  section  &  The  distribution  of 
the  totals  at  the  bottom  of  the  dassifier  is  a  random  arrange- 
ment, the  number  <^  scfMres  aiding  in  0,  L  2,  3.  c:c.  tending  to 
be  the  same  in  the  kmg  run. 

T  Tibers  in  the  ceSis  indicate  the  ranks  of  the  various 

i:  r  determined  after  all  of  the  tall3^1ng  is  comi^eted, 

by  countir .  ti  from  the  highest  score.  The  advantage  of  the 
dassifier  f :  ^  is  that  if  the  tallying  is  correct,  none  of  the 

somes  win  be  omitted  as  would  be  quite  likdy  if  they  were 
arranged  in  rank  order  by  searching  in  the  list  of  fifty  for  sue- 
cessivdy  smaller  items. 

When  a  score  of  152  has  been  readied  in  the  ranking,  three 
tallies  wiU  be  found.  Inasmuch  as  these  have  the  same  value 
it  is  customary  to  assign  to  eadi  the  av^age  rank  oi  11.  12, 
and  13,  which  is  12.  In  the  same  way  the  two  scores  of  151 
would  share  the  next  two  ranks  14  and  15,  each  being  gi¥m  the 
av«age  rank  of  14.5. 

In  addition  to  the  grouping  and  ranking  of  the  data  the  classi- 
fier win  also  be  found  useful  in  determining  the  wMdiam.  This 
avo-age  for  ranked  items  is  the  middle  sccNre,  or  is  halfway  be- 
tween the  two  middle  scores,  for  an  even  numbo*  of  items.  In 
the  problem  above  the  median  is  140.5  by  inspection. 

It  will  be  noted  that  the  median  is  here  d^ned  as  the  middle 
score.  For  an  odd  number  of  cases  this  definition  offers  no 
difficulty,  but  with  an  even  number  the  use  of  the  value  half- 
way betweoi  the  two  middle  scores  is  a  convention  supple- 
meitary  to  the  d^nition. 
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10-  Ci'Mi'LATi\T  Frequence'  Distributions 

It  is  oft^en  useful  to  have  the  data  arranged  in  a  cumulative 
rather  than  a  simple  frequency  distribution.  This  is  accom- 
plished by  tabulating  aU  of  the  frequencies  less  than  the  upper 
limit  of  each  class  interval.  For  the  Otis  material  the  cumu- 
lative distribution  would  be  as  follows : 

Table  6.  Cumulati\x  FKBQfENo  Dist&ibltion  foe  the 

Otis  Test  Data 


Scxmc  Lam  Than 


rilftTJlTTrit  F«IK|l-KNrT 


189.5 

50 

279^ 

49 

les.s 

4S 

159.5 

44 

149.5 

» 

139.5 

24 

129.5 

13 

119.5 

8 

199.5 

4 

'>^  .', 

2 

b'^.'j 

1 

The  cumulative  frequencies  are  of  course  easily  tabulated  after 
the  simple  frequency  distribution  has  been  made.  Both  meth- 
ods of  representing  series  will  be  extensively  used  in  applying 
the  descriptive  methods  of  the  following  chapto^ 


EXERCISES 

1.  Make  a  cUflBlfier  for  the  fifty  Terman  scores  in  the  table  on 
pa^  29  and  obtain  the  ranks  of  the  scores.  Determine  the  median 
from  these  ranks.  (123  or  122.7.   Ans,) 

2.  Work  out  a  scheme  for  ranking  the  Chicago  scores  and  obtain 
the  median.  (53.25.   Ans.) 


!tion  for  the  Terman  scores 
fe  169.5-179.5.   159.5-169.5, 


3.  Make  a  simple  frequenc>* 
from  the  claRFifier.    The  classes   .  ni 
etc.   Make  a  similar  distribution  for  the  Chicago  scores  with 
74.7.S-79.75,  69.75-74.75.  etc. 
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Scores  of  Fifty  Pupils  on  Three  Intelligence  Tests 


Test 

1 

Test 

Pupil 

! 

Pupil 

Otis 

Chicago 

Terman 

Otis 

Chicago 

Terman 

1  .  .  .  . 

171 

52 

117 

26  ...  . 

133 

47 

101 

2 

169 

75.5 

153 

27  . 

151 

53.5 

137 

3 

128 

50.5 

131 

28  . 

145 

56.5 

119 

4 

141 

46 

105 

29  . 

152 

56 

124 

5 

106 

39.5 

71 

30  . 

157 

66.5 

170 

6 

146 

55 

130  1 

31  . 

144 

60.5 

155 

7 

87 

34 

80 

32  . 

140 

60.5 

119 

8 

114 

42 

101 

33  . 

111 

38.5 

142 

9 

187 

70 

153 

34  . 

150 

63.5 

140 

10 

133 

51.5 

132 

35  . 

152 

65.5 

122 

11 

151 

59 

136 

36  . 

137 

48 

115 

12 

131 

52.5 

128  1 

37  . 

146 

54 

125 

13 

150 

63 

145  1 

38  . 

128 

44.5 

87 

14 

118 

44.5 

110  1 

39  . 

145 

57.5 

120 

15 

142 

65.5 

122 

40  . 

153 

50.5 

117 

16 

166 

61 

152 

41  . 

149 

53 

135 

17 

158 

55 

157   ; 

42  . 

114 

45.5 

100 

18 

101 

39 

88  1 

43  . 

135 

40 

125 

19 

159 

57.5 

156 

44  . 

131 

47 

120 

20  . 

126 

41.5 

92 

45  . 

161 

61 

149 

21  . 

136 

65.5 

115 

46  . 

95 

37 

87 

22 

137 

63.5 

109 

47  . 

134 

50 

103 

23 

152 

75.5 

151 

48  . 

124 

48.5 

119 

24  . 

137 

45 

132 

49  . 

125 

43 

95 

25 

132 

61.5 

130 

50  .  . 

167 

58.5 

178 

4.  Arrange  separately  the  Terman  and  Chicago  scores  in  the  form 
of  cumulative  frequency  distributions. 

5.  Make  a  frequency  distribution  for  the  following  scores,  using  an 
inter\^al  of  one  unit:  11,  12,  12,  13,  13,  13,  14,  14,  14,  14,  15,  15, 
15,  15,  15,  16,  16,  16,  16,  17,  17.  17,  18,  18,  19.  Calculate  the  average 
(mean).  What  will  the  average  be  if  the  intervals  are  taken  11-11.99, 
etc.,  instead  of  10.5-11.5,  etc.?  What  is  the  error  in  the  average  by 
the  former  tabulation  method?    ('Error  is  .5.) 

6.  Tabulate  separately  the  scores  on  page  30  on  speUing  tests  A 
and  B  for  125  pupils,  using  an  interval  of  5. 

7.  Retabulate  the  scores  in  Exercise  6,  using  an  inter^-al  of  10. 
Which  interval  is  better? 

8.  Make  cumulative  frequency  distributions  from  the  two  spelling 
test  distributions  of  Exercises  6  and  7. 
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Scores  of  125  Pupils  on  Two  Spelling  Tests  of  Equal  Difficulty 

(Maximum  Score  =  105) 


Test 

Test 

Test 

Test 

Test 

A 

B 

A 

B 

A 

B 

A 

B 

A 

B 

43 

44 

91 

84 

57 

59 

45 

37 

65 

68 

38 

41 

56 

55 

82 

87 

52 

49 

86 

87 

83 

87 

66 

61 

53 

61 

57 

69 

25 

25 

57 

65 

15 

18 

64 

68 

91 

96 

48 

57 

63 

70 

47 

62 

64 

62 

84 

78 

71 

73 

92 

89 

15 

21 

73 

66 

89 

89 

64 

62 

68 

63 

63 

70 

74 

76 

81 

76 

48 

57 

94 

91 

44 

39 

54 

58 

90 

92 

74 

74 

79 

84 

100 

95 

68 

77 

70 

76 

86 

93 

45 

45 

68 

73 

29 

35 

87 

91 

33 

29 

79 

81 

26 

30 

43 

53 

83 

79 

42 

44 

65 

70 

83 

79 

54 

56 

45 

48 

92 

97 

20 

33 

102 

101 

85 

87 

93 

89 

83 

88 

93 

91 

83 

70 

19 

26 

82 

74 

59 

64 

67 

72 

85 

85 

81 

80 

55 

57 

63 

62 

59 

61 

21 

27 

86 

83 

37 

41 

56 

48 

98 

99 

81 

76 

74 

77 

31 

26 

6 

6 

81 

84 

51 

46 

68 

75 

39 

42 

27 

37 

86 

79 

49 

52 

69 

62 

16 

16 

41 

52 

57 

71 

67 

61 

58 

54 

95 

96 

25 

31 

92 

82 

86 

80 

85 

85 

68 

71 

55 

59 

30 

35 

37 

35 

40 

36 

48 

56 

52 

55 

80 

79 

43 

49 

79 

90 

38 

49 

68 

77 

68 

72 

46 

47 

75 

80 

63 

68 

25 

22 

83 

75 

85 

83 

63 

55 

80 

86 

53 

59 

CHAPTER  III 

TABULAR  AND  GRAPHICAL  PRESENTATION  OF  DATA 

1.  Purpose  of  Tables  and  Diagrams 

Although  the  preparation  of  tables  and  diagrams  will  usually 
be  the  last  step  in  working  out  a  statistical  problem,  it  is  well 
to  consider  such  work  at  this  point  because  of  its  relative  sim- 
plicity and  concreteness.  For  many  elementary  studies,  more- 
over, such  as  school  and  publicity  reports,  the  tabulation  and 
graphical  representation  of  secondary  material  is  about  the 
only  statistical  method  required.  It  is,  therefore,  desirable  that 
everyone  dealing  with  educational  statistics  should  become 
acquainted  as  soon  as  possible  with  simple  tables  and  graphs. 

In  the  following  discussion  the  word  ''diagram"  is  used  to 
describe  all  sorts  of  graphs,  charts,  plots,  or  maps  used  for  the 
display  or  comparison  of  data. 

Tables  and  diagrams  have  a  twofold  purpose :  one  is  to  assist 
in  the  analysis  of  the  material  and  simplify  the  calculations  by 
representing  the  data  in  concise  and  orderly  fashion,  while  the 
other  is  to  summarize  and  make  clear  the  findings  of  a  study. 
Thus  the  chief  reason  for  arranging  material  in  a  frequency 
table  is  to  facilitate  analysis  and  calculation.  The  important 
characteristics  of  the  series  may  then  be  readily  determined  and 
the  required  calculations  made  more  easily  than  from  the  un- 
grouped  data.  On  the  graphical  side  a  method  of  calculation 
has  been  developed  known  as  nomography.  By  means  of 
curves  drawn  to  suitable  scales  a  great  many  statistical  calcu- 
lations may  be  made  very  quickly.  In  many  cases,  however, 
the  construction  of  the  nomograph  is  very  laborious  and  the 
desired  calculations  will  not  be  given  to  a  sufficient  number  of 
significant  figures.    With  the  modem  development  of  calculat- 
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ing  machines  and  statistical  tables  for  computation,  almost 
every  sort  of  calculation  will  be  found  to  be  easier,  more  rapid, 
and  much  more  accurate  by  numerical  rather  than  by  graphical 
methods. 

The  proper  use  of  tables  to  summarize  numerical  results  is 
important  because  the  success  of  a  statistical  study  may  depend 
a  great  deal  upon  the  skill  with  which  the  tabular  material  is 
arranged.  Good  tables  are  usually  brief  and  so  titled  as  to  be 
self-explanatory.  By  a  suitable  arrangement  of  headings  a  large 
amount  of  important  information  can  be  given  in  a  very  short 
space,  comparison  between  similar  items  facilitated,  and  visuali- 
zation of  group  relationships  made  possible. 

Graphs  or  diagrams  for  presentation  are  intended  to  make 
the  numerical  comparisons  clearer  and  more  vivid.  They  are 
not  primarily  intended  to  summarize  the  statistical  findings, 
which  should  appear  in  tabular  form  accompanying  the  diagram. 
If  too  many  details  are  given  in  a  chart  its  clarifying  value  is 
lost  and  a  diagram  that  is  not  clear  is  probably  not  worth 
making  at  all. 

2.  The  Construction  of  Tables  for  Presentation 

While  there  is  not  universal  agreement  as  to  the  terms  used 
and  the  best  form  for  a  table,  the  following  suggestions  have 
the  merit  of  successful  usage  in  the  publications  of  the  Russell 
Sage  Foundation. 

DEFINITIONS  OF  THE  PARTS  OF  A  STATISTICAL  TABLE 

1.  A  statistical  table  is  a  quantitative  presentation  of  facts  by 
means  of  numbers  arranged  in  a  column  or  columns  and  distributed 
according  to  one  or  more  groupings  of  the  subject  matter. 

2.  A  table  title  is  a  statement  appearing  at  the  head  of  a  statistical 
table,  showing  the  subject  with  which  the  table  deals. 

3.  A  column  in  a  statistical  table  is  the  series  of  numbers,  gener- 
ally relating  to  the  same  unit,  arranged  vertically  in  the  table. 

4.  A  line  in  a  statistical  table  is  a  series  of  numbers  arranged  in 
a  horizontal  row  in  the  table. 
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5.  The  body  of  a  table  is  the  aggregate  of  the  columns  and  the  lines. 

6.  A  column  heading  is  a  word  or  group  of  words  at  the  head  of  a 
column  of  numbers  in  a  table,  showing  the  unit  dealt  with  and  the 
relation  of  the  column  to  the  classification  followed. 

7.  A  brace  heading,  or  box  heading,  is  a  word  or  group  of  words 
appearing  above  two  or  more  columns  of  numbers  in  a  table,  w^hich 
it  has  the  effect  of  uniting  as  with  a  brace,  and  to  each  of  which  it 
bears  the  same  relation.  In  connection  with  the  column  headings, 
the  brace  heading  shows  the  unit  dealt  with  and  the  relation  of 
each  column  to  the  plan  of  classification  followed. 

8.  A  line  title  is  a  word  or  group  of  words  at  the  left  of  a  horizontal 
line  or  row  of  figures  in  a  table,  showing  the  relation  of  the  line  to 
the  plan  of  classification  followed. 

9.  A  total  is  a  statement  of  the  aggregate  of  two  or  more  numbers 
appearing  in  a  column  or  line. 

10.  A  grand  total  is  a  statement  of  the  aggregate  of  several  totals. 

Table  7.    Expenditure  per  Inhabitant  for  Operation  and  Mainte- 
nance OF  Schools  in  Cleveland,  and  in  Seventeen  Other  Cities  of 
FROM  250,000  to  750,000  Inhabitants,  1914 


City 

Estimated 
Population 

IN  1914  (IN 

Thousands) 

Expenditure  for  Operation 
AND  Maintenance 

Rank  in 
Expenditure 

Total  (in 
Thousands) 

Per 
Inhabitant 

per 
Inhabitant 

Baltimore 

Boston 

Buffalo 

Cleveland 

Detroit 

Indianapolis 

Jersey  City 

Kansas  City 

Los  Angeles 

Milwaukee 

Minneapolis 

Newark 

New  Orleans 

Pittsburgh 

San  Francisco     .... 

Seattle 

St.  Louis 

Washington 

580 
734 
454 
639 
538 
259 
294 
282 
439 
417 
343 
389 
361 
565 
449 
313 
735 
353 

$1955 
5517 
2450 
3570 
2553 
1410 
1421 
1761 
3707 
1795 
2148 
2699 
1098 
3602 
1879 
1751 
4085 
2392 

$3.37 
7.52 
5.40 
5.59 
4.75 
5.44 
4.83 
6.24 
8.44 
4.30 
6.26 
6.94 
3.04 
6.38 
4.18 
5.59 
5.56 
6.78 

17 

2 

12 

8.5 
14 
11 
13 

7 

1 
15 

6 

3 
18 

5 
16 

8.5 
10 

4 

Average 

— 

— 

$5.59 

— 
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The  model  table  on  page  33  illustrates  all  of  the  terms  used 
with  the  exception  of  the  totals,  which  were  not  necessary.  It 
will  be  noted  that  the  basic  information  upon  which  the  com- 
parisons are  made  is  given  in  the  table  so  that  it  could  be  veri- 
fied by  the  reader  in  case  of  doubt.  The  style  notes  used  in  the 
construction  of  the  table  are  given  in  the  following  list : 


STYLE  NOTES  FOR  MAKING  TABLES 

I.  Arrangement  of  Data 

1.  A  short  table  is  clearer  and  more  forceful  than  a  long  one. 

2.  Original  data  should  be  presented  in  full. 

3.  It  is  easier  to  compare  numbers  arranged  one  above  the  other 
than  numbers  placed  side  by  side.  Tables  should  be  arranged  so  that, 
as  far  as  possible,  numbers  to  be  compared  are  in  the  same  column. 

4.  Items  listed  in  a  table  should  usually  be  arranged  in  descending 
or  in  ascending  order  of  their  rank  in  the  trait  in  which  they  are 
being  compared. 

II.  Titles  and  Headings 

1.  The  titles  should  always  go  above  a  table  since  a  table  is  essen- 
tially a  list. 

2.  Titles  and  headings  should  be  so  worded  and  the  table  so  ar- 
ranged that  the  result  will  be  a  complete  whole,  independent  of  the 
accompanying  text. 

3.  Table  titles  should  place  emphasis  upon  the  fact  or  facts  which 
the  table  is  intended  to  show.  This  can  be  accomplished  by  placing 
the  important  facts  at  the  beginning  of  the  title. 

4.  Words  like  "table  showing,"  "number  of,"  and  "distribution 
of"  should  be  omitted  wherever  the  meaning  of  the  title  is  clear 
without  them. 

III.  Punctuation 

1.  In  table  titles  use  all  capitals  or  capitals  and  small  capitals. 

2.  In  column  headings  and  in  line  titles,  capitalize  the  initial  letter 
of  each  important  word.  (In  printing,  capitals  and  small  capitals 
may  properly  be  used.) 

3.  Do  not  end  a  title  with  a  period.  If  the  title  consists  of  two 
sentences,  put  a  period  after  the  first  sentence. 
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4.  Do  not  use  periods  in  column  headings  or  in  line  titles  except 
for  abbreviations  and  to  separate  sentences  as  above.  Avoid  abbre- 
viations when  possible. 

5.  Do  not  use  periods  in  the  body  of  a  table  except  to  separate 
dollars  from  cents  or  units  from  tenths. 

6.  Where  one  line  of  items  is  to  be  compared  with  those  in  the 
rest  of  the  table  this  line  may  be  in  heavier  type,  so  that  it  may  be 
more  readily  seen. 

IV.  Symbols 

1.  Ditto  marks  should  not  be  used  either  in  the  body  of  a  table  or 
in  its  headings  and  titles. 

2.  Where  sums  of  money  are  stated  in  columns,  the  dollar  sign 
should  be  placed  before  the  first  item  in  the  list  and  before  the  total 
or  average. 

3.  Footnotes  to  the  table  should  be  indicated  by  letters  and  not 
by  figures  (also  good  form  to  use  symbols  such  as  *,  §,  etc.). 

4.  Where  data  are  not  available  do  not  fill  in  the  space  in  the 
table  with  O's.    Reserve  0  for  the  definite  information  that  it  gives, 

that  is,  nothing ;   use or to  show  that  no  figures 

are  at  hand. 

5.  A  row  of  dots  or  dashes  on  the  lower  part  of  the  line  may  be 
used  in  the  first  column  to  guide  the  eye  from  each  item  to  its  cor- 
responding figure.  These  dots  should  not  extend  beyond  the  first 
vertical  rule. 

V.  ''Total"  and  ''Per  Cent" 

1.  "Total"  should  always  be  written  in  the  singular. 

2.  "Per  cent"  should  be  written  in  two  words,  with  no  period. 

VI.  Ruling 

1.  There  should  be  a  double  rule  at  the  top  of  the  table. 

2.  A  single  horizontal  rule  should  separate  column  headings  from 
the  body  of  the  table. 

3.  At  the  bottom  of  the  table  there  should  be  a  double  horizontal 
rule. 

4.  Totals  and  averages  should  be  separated  from  the  numbers  of 
which  they  are  the  aggregates,  by  single  heavy  rulings  (single  light 
ruling  is  also  good  form). 

5.  There  should  be  vertical  rules  between  the  line  titles  and  the 
figures,  and  between  each  two  columns  of  figures. 
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6.  Tables  should  not  be  closed  in  at  the  sides  by  vertical  rules. 

7.  Each  column  heading  should  be  boxed  in  except  at  the  two 
outer  sides. 

8.  These  rules  may  be  summarized  as  follows:  There  are  three 
kinds  of  lines  used  in  ruling  a  table :  double  lines  at  the  top  and  bot- 
tom of  the  table ;  single  lines  between  column  headings  and  figures, 
and  between  columns;  and  heavy  lines  before  totals  and  averages. 

VII.  Spacing 

1.  In  long  tables  it  is  well  to  leave  a  double  space  after  each  five 
or  ten  lines  of  figures,  to  facilitate  the  reading. 

2.  Numbers  should  be  placed  in  the  middle  of  the  column  with 
corresponding  units  directly  under  each  other. 


3.  Column  and  Bar  Diagrams 

For  a  full  account  of  the  great  variety  of  diagrams  which  may 
be  used  the  reader  is  referred  to  such  texts  as  Williams's,  listed 
with  the  selected  texts  in  the  bibliography.  The  discussion  here 
will  be  confined  to  a  few  simple  types  which  serve  most  of  the 
purposes  in  an  ordinary  statistical  study  and  which  can  be  made 
without  much  training  or  great  outlay  of  drawing  materials. 
If  elaborate  figures  are  required  it  is  probably  better  to  have 
them  drawn  by  an  artist  from  a  rough  sketch  rather  than  spend 
time  in  acquiring  the  skill  necessary  to  use  a  drawing  board  and 
instruments.  For  the  great  majority  of  articles,  books,  and 
theses,  however,  only  the  simplest  types  of  diagrams  are  neces- 
sary, and  these  may  be  made  on  ruled  paper  in  black  ink  with 
very  little  practice. 

The  column  diagram  consists  of  a  series  of  columns  propor- 
tional in  height  to  the  quantities  represented.  A  scale  usually 
appears  at  the  left  and  a  legend  either  on  the  background  near 
the  columns  or  below  as  in  Fig.  6.  In  this  figure  two  varying 
quantities  are  shown  very  effectively  on  the  same  chart,  the 
hatched  portion  representing  the  undesirable  condition.  Such 
a  diagram  may  be  made  with  india  ink  on  ruled  graph  paper, 
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Compulsory  age"*! 
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blue  lines  being  preferred,  because  these  will  be  invisible  if  the 
chart  is  photographed. 

Fig.  7  is  another  ingenious  variation  of  the  column  diagram. 
Each  block  represents  a  school  identified  by  number  so  that  it 
is  possible  to  compare  any  school  with  another  or  with  the  whole 
group.  This  type  of  dia- 
gram can  be  effectively 
used  to.  represent  a  group 
of  test  scores  in  such  a 
way  that  each  pupil  can 
recognize  his  score  by 
number  without  reveal- 
ing this  fact  to  the  rest 
of  the  class. 

In  case  the  columns 
are  used  to  represent 
the  frequencies  of  the 
various  classes  along  the 
horizontal  scale  the  re- 
sulting diagram  is  known 
as  a  histogram.  The  col- 
umns are  then  propor- 
tional in  height  and 
area  to  the  frequencies, 
and  this  property  makes 
the  histogram  an  excel- 
lent representation  of  a 

frequency  distribution.  The  histogram  for  the  Otis  scores  in 
Table  4  is  given  in  Fig.  8.  It  will  be  noted  that  the  horizontal 
scale  is  given  in  even  integers  and  the  column  moved  slightly 
to  the  left  so  as  to  have  the  intervals  79.5-89.5  etc. 

An  alternative  representation  of  the  frequency  distribution  is 
given  by  the  frequency  pohjgon.  This  consists  of  lines  connect- 
ing the  frequencies  taken  at  the  midpoints  of  the  class  inter- 
vals.   In  Fig.  9  a  histogram  and  frequency  polygon  are  plotted 
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Fig.  6.   Showing  the  holding  power  of 
the  schools 

The  columns  represent  the  children  enumerated 
by  the  school  census  as  of  each  age  from  six 
through  twenty.  Portion  in  outline  represents 
children  in  public  schools.  Portion  in  black  rep- 
resents those  not  in  public  schools.  {Cleveland 
Education  Survey  Report,  1916) 
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on  the  same  background  for  comparison.    It  will  be  noted  that 
the  area  under  the  histogram  between  I.Q.'s  from  90  to  100  is 
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Fig.  7.  Average  scores  made  in  spelling  by  ninety-six  elementary  schools  * 

The  figures  below  the  diagram  show  the  percentages,  and  the  ones  in  the  diagram 

show  the  number  of  the  schools 

exactly  proportional  to  the  observed  frequency  over  that  range. 
The  area  under  the  polygon  over  this  same  range  is  somewhat  in 
defect,  however,  and  similar  discrepancies  occur  for  the  other 

intervals  unless  three  points 
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on  the  polygon  happen  to  be 
on  a  line.  For  these  reasons 
the  histogram  is  a  better  rep- 
resentation of  the  frequency 
distribution  when  a  curve  is 
to  be  fitted  to  the  data.  In 
case  a  rough  diagram  is  re- 
quired to  show  the  overlap- 
ping of  several  distributions, 
the  polygon  is  probably  bet- 
ter, but  for  most  other  repre- 
sentations the  histogram  is 
preferable. 
In  bar  diagrams  the  varying  quantities  are  repi'esented  by 
horizontal  bars  as  in  Fig.  10.  The  chief  reason  for  preferring  a 
bar  to  a  column  arrangement  is  one  of  convenience.   If  the  line 

♦  From  C.  H.  Judd.  "  Measuring  the  Work  of  the  Public  Schools,"  Cleveland  Edu- 
cation Survcif  Report,  1916,  p.  84. 
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Otis  Score 


Fig.  8.   Histogram  of  Otis  Test  Scores 


30  40  50  60  70  80  90  100 


Fig.  9.  Histogram  and  frequency  poly- 
gon for  the   intelligence   quotients  of 
Table  20,  Chapter  VII 
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titles  are  large  as  in  the  ac- 
companying figure,  the  use 
of  columns  would  be  awk- 
ward. For  a  fairly  large 
number  of  items  the  bar 
diagram  will  also  be  found 
to  be  more  effective. 

Quantities  which  exhibit 
variation  in  one  dimension 
should  be  represented  by 
column  or  bar  diagrams 
which  are  themselves  one- 
dimensional.  The  use  of 
three-dimensional  diagrams,  such  as  a  row  of  persons  of  vary- 
ing size  to  show  increase  in  population,  may  be  very  misleading 

$64.78 
61.18 
58.97 
56.73 
52.96 
52.70 
52.40 
51.34 
51.32 
50.25 
46.59 
46.38 
45.08 
44.66 
43.17 
38.51 
33.07 

32.54 

0      $10      $20      $30      $40      $50      $60 

Fig.  10.   Expenditure  per  child  in  average  daily  attendance  for  operation  and 
maintenance  of  public  schools,  for  Cleveland  and  for  seventeen  other  cities  * 

because  there  is  doubt  as  to  whether  the  height,  area,  or  vol- 
ume of  the  figures  is  proportional  to  the  change  in  population. 

*  From  Earl  Clarke,  "Financing  the  Public  Schools,"  Cleveland  Education  Survey 
Report,  1916,  p.  37. 
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4.  Coordinates 

In  order  to  remind  the  reader  of  his  coordinate  geometry, 
which  will  be  very  much  needed  in  the  work  which  is  to  follow, 
the  next  few  paragraphs  will  be  devoted  to  a  summary  of  the 
elements  of  that  subject. 

If  two  straight  lines  OX  and  OY  intersect  in  a  plane,  it  is 
possible  to  describe  the  location  of  any  point  P  in  the  plane 
with  respect  to  the  point  of  intersection  O.  For  most 
representations  it  is  convenient  to  have  the  lines  inter- 
sect at  right  angles.  The  horizontal  line  OX  is  known 
as  the  a:-axis,  or  axis  of  abscissas,  while  the  vertical 

line  07  is  called  the  2/-axis,  or  axis 
of  ordinates.  The  distances  of  the 
point  P  from  the  two  axes  are 
known  as  coordinates  of  the  point. 
Thus  in  Fig.  11  the  abscissa  of  the 
point  P  is  OM,  or  four  units,  while 
its  ordinate  is  ON,  or  three  units. 
These  two  coordinates  will  locate 
uniquely  the  position  of  any  point 
P  with  respect  to  the  origin  0. 
It  will  be  noted  that  in  Fig.  11  only  positive  quantities  can  be 
represented.  In  case  negative  numbers  occur,  the  coordinate 
system  may  be  extended  as  shown  in  Fig.  12.  The  plane  is  thus 
divided  into  four  quadrants  numbered  in  counterclockwise  direc- 
tion about  O.  The  coordinates  of  a  point  in  the  second  and 
fourth  quadrants  are  opposite  in  sign,  while  those  for  a  point 
in  the  third  quadrant  are  both  negative.  The  coordinates  of 
the  four  points  in  the  diagram  are  as  follows: 


N  Abscissa=4       -* 

?, 


CO 

MP 


O      1 

Fig.  11. 


2         3         4        5         6       X 

Illustrating  ordinate 
and  abscissa 


Point 

Abscissa,  X 

Ordinate,  Y 

Pi 

+  4 

+  2 

Pa 

-5 

+  3 

Pa 

-3 

-2 

P4 

+  7 

-1 
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In  plotting  mathe- 
matical relationships  it 
is  usually  necessary  to 
employ  the  more  ex- 
tended scheme  with 
four  quadrants,  but  in 
dealing  with  statistical 
data  which  are  usually 
positive,  the  first  quad- 
rant will  suffice. 

The  following  table 
gives  the  lung  capacity 
in  cubic  inches  of  521 
boys  in  the  laboratory 
schools  of  The  Univer- 
sity of  Chicago.  The 
ages  of  the  boys  ranged  from  five  to  nineteen  years,  the  meas- 
urements being  made  within  a  few  days  of  each  birthday. 

Table  8.  Lung-Capacity  Data  from  the  Laboratory  Schools 


7 

6 

6 

4 

3 

2 

1 

0 

-1 

-2 

-3 

-4 

-5 

-6 

-7 


Y 

Secor 

id  Qua 

dra'nt 

1 

First  Quad 

ran 

t 

% 

P, 

Origin 

X 

I  ^ 

4_ 

4 

T 

hir 

iQ 

uad 

Tar 

It 

Fourth  Quadrant 

1      1      1      1 

-7  .6  -5   -4   -3  -2    -1     0     1     2     3     4     5     6     7 

Fig.  12.   Illustrating  plotting  in  four 
quadrants 


Age 

Average  Lung  Capacity 

5 

76 

6 

73 

7 

88 

8 

95 

9 

106 

10 

122 

11 

129 

12 

148 

13 

165 

14 

184 

15 

211 

16 

230 

17 

252 

18 

264 

19 

287 

These  data  have  been  plotted  in  Fig.  13  and  the  points  connected 
in  the  form  of  a  polygon.  The  trend  appears  fairly  straight  with 
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the  exception  of  a  general  dip  during  the  years  of  adolescence. 
This  dip  has  been  verified  by  other  material.* 

Such  a  plot  as  that  shown  below  is  of  value  in  analyzing  the 
data  and  in  giving  to  the  reader  a  clear  idea  of  the  relation 
between  the  variables  involved.    We  shall  turn  next  to  the 

general  consideration  of  such  func- 

Lung  capacity 

Cubic  inches  tional  relationships. 

5.  Functional  Relationships 

When  two  variables  are  so  related 
that  the  value  of  the  first  variable 
depends  upon  the  value  of  the  sec- 
ond variable,  then  the  first  variable 
is  said  to  be  a  function  of  the  second. 
The  area  of  a  square,  for  example,  is 
a  function  of  the  length  of  the  side ; 
that  is,  area  equals  (side)-,  or  y  =  x-. 
Here  the  relationship  is  exact,  all 
true  squares  conforming  precisely  to 
the  law.  Such  functional  relationships  may  be  called  mathe- 
matical, and  are  generally  written  in  the  form  y  =  j{x). 

The  second  variable,  to  which  values  may  be  assigned  at 
pleasure,  is  called  the  independent  variable,  or  argument ;  and  the 
first  variable,  whose  values  are  determined  as  soon  as  values  of 
the  argument  are  assigned,  is  called  the  dependent  variable,  or  the 
function.  In  the  above  example  the  side  x,  representing  the  side 
of  the  square,  is  the  independent  variable,  while  y,  representing 
the  area  of  the  square,  is  the  dependent  variable. 

Other  examples  of  functions  are  breathing  capacity,  which 
is  a  function  of  the  age  of  the  person ;  the  number  of  words 
typed  per  minute,  which  is  a  function  of  the  hours  of  practice ; 
and  the  score  on  an  achievement  test,  which  is  a  function  of  the 
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Fig. 


6    8  10  12  14  16  18  20 
Age  in  years 

13.     A   plot    of   lung- 
capacity  data 


*  Karl  J.  Holzinger,  "On  the  Relation  of  Vital  Capacity  to  Certain  Psychical 
Characters,"  Biomeirika,  Vol.  XVI,  p.  13'). 
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time  spent  in  studying  the  subject  tested.  Such  functions  differ 
from  exact  mathematical  functions  in  that  they  depend  upon 
many  more  variables  than  the  ones  given,  and  the  relationships 
indicated  are  only  approximate.  Breathing  capacity,  for  in- 
stance, depends  upon  a  great  many  factors  other  than  age.  A 
curve  or  an  equation  expressing  the  most  probable  breathing 
capacity  for  given  ages  will  then  furnish  a  basis  for  rough  es- 
timation rather  than  exact  prediction.  An  important  part  of 
statistical  method  is  concerned  with  the  selection  of  those 
mathematical  functions  which  will  give  the  best  ''fit"  for  a 
given  body  of  data  (see  Chapter  XVI). 


6.  The  Straight  Line 

One  of  the  simplest  mathematical  functions  is  that  wherein 
the  change  in  y  is  directly  proportional  to  the  change  in  x, 
for  example,  ?/  —  3  x  or  y  =  ^  x. 
The  graphs  of  these  functions  will 


O""     1    2    3    4    5    6    7    8    9  10  X 

Fig.  14.   Graphs  of  the  Hnes 
y  —  2>x  and  y  =  \x 
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be  straight  lines  through  the  origin,  as  shown  in  Fig.  14.  In 
obtaining  the  coordinates  of  various  points  it  is  only  necessary 
to  substitute  arbitrary  values  for  the  argument  x,  and  find  the 
corresponding  values  of  y.  While  only  two  points  are  necessary 
to  determine  a  straight  line,  one  other  value  has  been  given  as 
a  check. 
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The  general  equation  of  a  straight  Une  may  be  written  in  the 
form  y  =  ax  -]-  b,  where  6  is  a  constant  representing  the  distance 

from  the  origin  to  the  point  of  intersec- 
tion of  the  given  Hne  and  the  y-a.xis 
(y-inter cept) ,  Sind  a  is  a  constant  repre- 
senting the  slope  of  the  Hne  (the  tangent 
of  the  angle  which  the  line  makes  with 
the  a:-axis).  The  line  y  =  2x-\-3  is 
shown  in  Fig.  15. 
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7.  Non-Linear  Relationships 


The  term  ''curve"  is  employed  in 
mathematics  to  designate  any  line, 
straight  or  curved,  when  located  with  reference  to  some  coor- 
dinate system.  It  has  been  noted  that  equations  of  the  first 
degree  in  x  furnish  straight  lines  when  graphed.  In  case  higher 
powers  of  the  argument  are  present,  some  other  form  of  curve 
results.  One  of  the  sim- 
plest of  these  is  the  parab- 
ola the  general  equation  of 
which  is  1/  =  ax2 -\-hx-\-c, 
where  the  letters  a,  6,  and 
c  again  represent  con- 
stants which  determine 
the  particular  curve.  The 
parabola  y  =  x'^  —  Zx-\-2 
is  shown  in  Fig.  16.  Here 
positive  and  negative  val- 
ues of  the  argument  were 
substituted  in  the  equa- 
tion of  the  parabola  to  find  the  corresponding  values  for  y. 
The  normal  probability  curve,  which  will  be  used  a  great  deal 

in  the  subsequent  work,  may  be  written  in  the  form  y  =  e  ^  , 
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Fig.  17.   Graph  Qiy  =  e 
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where  e  is  the  base  of  the  Napierian  system  of  logarithms  and 
is  equal  to  2.71828.  The  curve  in  Fig.  17  has  been  plotted  from 
the  series  of  values  furnished  at  the  right.  These  values  could 
be  calculated  directly,  but  are  readily  obtained  from  tables  al- 
ready prepared.  It  is  evident  that  the  same  positive  and  nega- 
tive values  of  the  argument  give  only  one  value  for  the  function, 
so  that  the  curve  is  symmetrical  about  the  ^-axis.  It  is  also  to 
be  noted  that  the  vertical  scale  unit  was  not  taken  equal  to  the 
horizontal  one.  The  choice  of  scale  units  will  of  course  in  no 
way  alter  the  properties  of  the  curve  and  is  largely  a  matter  of 
taste  unless  the  curve  is  to  be  ''fitted"  to  a  series  of  observa- 
tions.  (See  Chapter  XII,  section  5.) 


EXERCISES 

1.  Calculate  the  valuation  per  inhabitant  from  the  following  data, 
computing  the  per  capita  valuations  to  the  nearest  dollar.  Make 
a  table  ruled  up  according  to  the  specifications  in  section  2.  The 
columns  in  the  table  will  be  (1)  city,  (2)  population,  (3)  total  valua- 
tion, (4)  valuation  per  inhabitant,  (5)  rank. 


City 

Population  in  1914 
(Thousands) 

Estimated  Valuation 

OF  All  Property 

Assessed 

(Thousands) 

Baltimore 

Boston 

Buffalo 

Cleveland 

Detroit 

Indianapolis 

580 
734 
454 
639 
538 
259 
294 
282 
439 
417 
343 
389 
361 
565 
449 
313 
735 
353 

$723,800 
1,489,609 
494,200 
756,831 
598,634 
363,414 

Jersey  City 

257,645 

Kansas  City      

371,191 

Los  Angeles 

Milwaukee 

836,604 
511,721 

Minneapolis 

Newark 

639,259 
383,864 

New  Orleans 

Pittsburgh 

San  Francisco 

314,086 

789,035 

1,247,391 

Seattle 

St.  Louis 

Washington 

473,175 

1,125,309 

538,390 
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2.  Make  a  column  diagram  for  the  following  data  on  centimeter 
or  similar  graph  paper : 

a.  Each  column  represents  a  grade  and  is  proportional  in  height 
to  the  membership  of  the  grade. 

h.  Darken  the  upper  part  of  each  column  to  show  the  proportion 
of  overage  children  in  each  grade. 

c.  Make  the  columns  one  centimeter  wide  and  leave  a  one-half- 
centimeter  space  between  columns. 

d.  Print  the  total  membership  over  each  column  or  use  a  scale  at 
the  left. 

e.  Put  a  suitable  title  at  the  bottom  of  the  diagram. 

Number  of  Normal  and  Overage  Pupils  in  an  Ideal  School  in 
WHICH  AN  80  Per  Cent  Promotion  Rate  is  in  Effect 


Grade 

Total 

Normal 

Overage 

I 

11 

Ill 

125 
125 
125 
125 

124 
121 
109 

85 

120 
112 
103 
92 
82 
73 
63 
55 

5 
13 
22 

IV   

33 

V 

VI    

42 
48 

VII 

VIII 

46 
30 

3.  Plot  the  following  pairs  of  scores  for  quality  and  speed  on  the 
Ayres  Handwriting  Scale. 

Q.   42,  31,  65,  59,  38,  62,  35,  47,  57,  67,  51,  42,  34,  29,  63 
S.    94,  91,  87,  81,  80,  78,  75,  74,  75,  73,  70,  68,  61,  43,  75 

4.  Make  histograms  for  the  distributions  found  in  Exercise  3  of 
Chapter  II. 

5.  Make  histograms  for  the  two  spelling  distributions  of  Exer- 
cises 6  and  7  of  Chapter  II. 

6.  Construct  graphs  for  the  cumulative  frequency  distributions 
given  by  Exercises  4  and  8  of  Chapter  II. 

7.  Plot  the  straight  lines, 

(a)  2/  =  3  X  -  7,  (b)  y  =  2x  +  6,  (r)  x  =  3  ?/  -  4. 

8.  Plot  the  curves,  -  x2 
(a)  i/  =  3x2  +  2x-l,    (h)  y  =  4x'-6x''  +  2x  +  S,    {c)ij  =  10e'^. 

(Make  use  of  the  values  given  in  section  7.) 


CHAPTER  IV 

LOGARITHMS 

1.  Introductory 

For  most  computations  it  is  best  to  use  a  calculating  machine, 
but  for  students  such  aids  are  frequently  out  of  the  question. 
They  must  often  resort  to  ordinary  arithmetic,  slide  rules,  or 
logarithms  in  working  out  statistical  problems.  In  dealing  with 
classroom  exercises  and  even  extended  problems  such  as  those 
arising  in  connection  with  a  thesis,  logarithms  will  be  found  to 
be  extremely  convenient  and  accurate.  The  present  chapter  is 
therefore  devoted  to  a  brief  account  of  their  nature  and  use. 

The  student  who  is  familiar  with  logarithms  may  omit  this 
chapter,  but  it  frequently  happens  that  one  needs  to  review  this 
subject.  The  present  material  may  then  serve  not  only  as  a 
short  introduction  for  the  student  who  knows  nothing  of  loga- 
rithms, but  also  as  a  convenient  reminder  of  some  of  the  things 
once  known  but  forgotten. 

2.  Arithmetical  and  Geometrical  Progressions 

An  arithmetical  progression  is  a  succession  of  terms  such  that 
each  term  differs  from  that  immediately  preceding  it  by  a  con- 
stant known  as  the  common  difference.  Show  that  the  following 
are  examples  of  such  arithmetical  progressions  or  series : 

Arithmetical  Progression  Difference,  d 

a.  1,  2,  3,  4,  5,  6,  •  •  • +1 

6.  16,  14,  12,  10,  8,  6,  ••  • -2 

c.  6,  11,  16,  21,  26,  31,  36,  ••  • +5 

d.  2i,  3f ,  5,  6},  7i,  •  •  •      +  U 

e.  -5,  -3,  -  1,  +  1,  +3,  +5,  •  •• +2 

/.  a,  a-^  d,  a  +  2d,  a  +  S  d,  •  '  ' -\- d 
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A  geometrical  progression  is  a  series  of  terms  such  that  each 
term  is  the  product  of  the  preceding  term  by  a  constant  known 
as  the  ratio.  Examples  of  such  progressions  are  as  follows : 

Geometrical  Progression  Ratio,  r 

a.  1,  2,  4,  8,  16,  32,  ••  • 2 

b.  100, -50,  25, -12.5,  6.25,  ••  • -.5 

/»        f;     5.     ^      _5_     _5_     .  .  .  1 

^'       «^»    3'    9»    2  7'    8  1»  3 

d.    a,  ar,  ar^,  ar^,  ar'^,  •  •  • r 

The  abbreviation  for  an  arithmetical  progression  is  A.  P. 
and  for  a  geometrical  progression  is  G.P. 

The  arithmetical  mean  of  a  series  is  obtained  by  dividing  the 
total  of  the  numbers  by  the  number  of  items  in  the  series.  Thus 
in  the  series  1,  2,  3,  4,  5,  6,  7,  the  mean  is  28/7  =  4.  A  general 
procedure  for  finding  the  mean  of  any  arithmetical  series  may 
be  shown  as  follows :  Let  the  first  term  and  the  difference  be 
any  algebraic  numbers  denoted  by  a  and  d  and  let  ?2  be  a  posi- 
tive integer  representing  the  number  of  terms.  We  may  then 
write 

Number  of  term :  1       2  3       •  •  •  n 

Progression:  a   a-\-d   a  +  2d'--  a-^{n  —  l)d 

The  last  term,  or  I,  is  clearly  given  by  the  formula 

l  =  a  +  {n-l)d.  (1) 

If  s  denotes  the  sum  of  the  n  terms  in  such  a  progression,  this 
sum  written  in  natural  and  in  reverse  order  will  give 

s  =  a  4-  [a  +  c^]  +  [a  +  2  c^]  +  •  •  •  [a -\-  {n  -  I)  d] 
and    s  -  ^  +  [^  -  d]  +  [/  -  2d]  +     -  -  -  \l  -  {n  -  I)  rf]. 

Adding  these  two  equations,  member  by  member,  we  find  that 

s  =  "("  +  ^.  (2) 

The  arithmetical  mean,  A.  M.,  is  therefore  given  by 

AM.  =  i  =  l±i.  (3) 

n         2 


LOGARITHMS  49 

Applying  this  formula  to  the  A,P,  6,  11,  16,  21,  26,  31,  we 

obtain  A.M,  =  —^ —  =  18.5. 

If  three  numbers  form  a  G,P,  the  middle  number  is  called 
the  geometrical  mean  of  the  other  two  and  is  obtained  by  extract- 
ing the  square  root  of  their  product.  This  follows  at  once  from 
the  general  form  of  a  G,P,:  a,  ar,  ar^  -  -  •  ar""'^  =  I.  For  any 
two  numbers  a  and  6,  therefore,  the  geometrical  mean  is  given 

by  the  formula  . — 

G.M.  =  Vab,  (4) 


Example.  The  G.M.  of  1  and  9  is  Vl  x  9  =  3,  that  is, 
1,  3,  9  are  in  a  G.  P.  with  ratio  3. 

Insert  four  geometric  terms  between  18  and  2T-  The  first 
term  a  =  18,  1  =  ^,  and  w  =  4  +  2  ==  6.     Since  I  =  ar''~'^  we 

have  -  =  r""-^  =  -^f  whence  r  =  ^.    The   required    terms  are 

therefore  6,  2,  f ,  and  f . 

3.  The  Invention  of  Logarithms 

The  most  important  discovery  in  the  development  of  mathe- 
matical computation  was  the  invention  of  logarithms  by  John 
Napier,  Baron  of  Merchiston  of  Scotland  (1550-1617).  The 
principle  underlying  his  invention  may  be  explained  in  terms 
of  arithmetical  and  geometrical  progressions. 

Let  such  a  pair  of  associated  series  be  given  as  follows : 

A.P.    01234567         8         9         10 
G.  P.    1     2     4     8     16     32     64     128     256     512     1024 

The  product  of  any  two  numbers  in  the  second  line  of  numbers 
(G.P.)  may  be  found  by  adding  the  corresponding  numbers  in 
the  A.P.y  finding  this  sum  in  the  A.  P.,  and  finally  taking  the 
corresponding  number  in  the  G.P.  line  as  the  required  answer. 
Thus  the  product  4  X  128  may  be  found  by  adding  2  and  7  (the 
numbers  in  the  A.P.  corresponding),  finding  their  sum  (9)  in 
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the  A.  P.,  and  then  the  corresponding  number  (512)  in  the  G.P., 
this  being  the  required  product.  The  time-saving  principle 
illustrated  by  this  method  is  that  the  process  of  multiplication 
is  replaced  by  that  of  addition. 

It  is  apparent  that  series  such  as  the  above  furnish  only  a  few 
of  the  possible  products  which  might  be  required.  The  system 
needs,  therefore,  to  be  extended.  In  addition  to  continuing  the 
progressions  at  either  end,  Napier  inserted  terms  as  illustrated 
by  the  following  series : 

.5         1         1.5         2         2.5         3         3.5 

V2        2         VS        4        \/32        8      \/l28 
G.P.i.       1.41       2        2.83        4         5.66        8       11.31 

This  amounts  to  inserting  arithmetical  and  geometrical  means 
between  the  original  terms. 

The  above  series  are  tabular  representations  of  the  function 
y  =  2"",  where  x  denotes  the  numbers  in  the  A.  P.,  and  y  the 
numbers  in  the  G.  P.  The  number  2  is  called  the  base  and  x  is 
said  to  be  the  logarithm  of  y  to  the  base  2.  The  logarithm  of  a 
number  is  thus  the  exponent  to  which  a  fixed  number,  called  the 
base,  must  be  raised  to  equal  the  given  number,  or,  if  2/  =  6-^,  then 
X  is  the  logarithm  of  y  to  the  base  b,  or  x  =  \ogb  y. 

If  2  is  the  base,  log2  64  =  6,  because  2^  =  64 ;  if  8  is  the  base, 
logs  64  =  2,  because  8^  =  64.  The  number  of  possible  bases  is 
clearly  infinitely  large. 

The  invention  of  logarithms  by  Napier  stimulated  an  Eng- 
lishman by  the  name  of  Henry  Briggs  to  work  out  a  system  of 
logarithms  to  the  base  10.  Between  the  years  1617  and  1628, 
Briggs  and  others  completed  tables  of  logarithms  up  to  100,000 
carried  out  to  fourteen  decimal  places.  Many  other  tables  have 
since  been  computed,  one  of  the  most  complete  being  a  20-place 
table  carried  out  by  Mr.  A.  J.  Thompson*  under  the  direction 
of  Professor  Pearson. 

*  A.  J.  Thompson,  Logarithmetica  Britannica,  boing  a  Standard  Table  of  Loga- 
rithms to  Twenty  Decimal  Places.    Cambridge  University  Press,  London,  1924. 
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The  present  chapter  will  deal  entirely  with  the  Briggs  loga- 
rithms with  a  base  of  10.  Before  considering  their  use,  how- 
ever, a  brief  review  of  the  laws  of  exponents  will  be  given. 


4.  Laws  of  Exponents 

The  symbol  a""  is  used  to  represent  the  product  of  a  to  n  equal 
factors,  or  a''  =  a  '  a  '  a  '  a  '  •  •  to  n  factors  where  n  is  a  positive 
integral  (whole)  exponent.  Certain  fundamental  laws  for  such 
exponents  may  now  be  given  as  follows : 

L  (a^)(a'^)  =  a^+^;  for  example,  (102)(103)  =  102+3  =  105. 
This  follows  at  once  from  the  fact  that 

(a"")  (a")  =  (a  •  a  '  a  '  '  •  to  m  factors)  (a  -  a  -  a  -  -  -  to  n  factors) 
=  a  •  a  •  a  •  a  •  •  •  to  (m  +  n)  factors. 

The  remaining  laws  are  proved  in  a  similar  way. 
IL  ^  =  a--^ ;   for  example,  |^  =  8^-4  =  82. 

in.  (a^)^  =  a^^;  for  example,  (202)3  =  20^. 
IV.  (a6)"  =  a"6^ :  for  example,  (3  x  4)2  ==  32  x  42. 

2\3^2^ 

.V     33* 

The  above  laws  also  hold  when  the  exponents  are  any  positive  or 
negative  integral  or  fractional  numbers.  Fractional  and  nega- 
tive exponents  are  defined  as  follows : 

in 

a^  =  Vo^ ;  for  example,  8^  =  \/82  =  4. 
a~"  =  — ;  for  example,  16~2  = 


1         1 


16"^  = 


Vie    4 


If  a-""  =  —y  it  follows  that  a"""  =  a^  =  1. 
a" 

Thus  any  number  to  the  zero  power  is  equal  to  one.   This  is  an 

important  law  and  should  be  remembered. 


52  STATISTICAL  METHODS  IN  EDUCATION 

Some  further  illustrations  of  the  above  laws  are  as  follows : 

16^  \/l6      2 

(272)^  =  "v^(27)2  =  9. 

(i)    ^  =  (|y  =  Vi:5  =  1.225. 

52      52      25* 

5.  Laws  of  Logarithms 

From  the  definition  of  a  logarithm  and  the  laws  of  exponents, 
the  basic  principles  for  logarithmic  computation  may  be  ex- 
pressed as  follows : 

I.  The  logarithm  of  a  product  is  equal  to  the  sum  of  the  loga- 
rithms of  the  factors,  or 

logft  MN  =  logb  M  +  logft  N, 

Proof.  Let  x  =  logb  M  and  y  =  log^  N.  Then  h""  =  M  and 
hy  =  N  (from  the  definition  in  section  3)  and  MN  =  6^+^,  or 
logb  MN  =  X  -\-  y  =  logb  M  +  logb  N.  The  proofs  for  the  re- 
maining laws  are  similar. 

II.  The  logarithm  of  a  quotient  is  equal  to  the  logarithm  of  the 
dividend  minus  the  logarithm  of  the  divisor,  or 

logb  —  =  logb  M  -  logbiV. 

III.  The  logarithm  of  the  nth  poiver  of  a  number  is  n  times  the 
logarithm  of  the  number,  or 

logb  M"  =  n  logb  M. 

IV.  The  logarithm  of  the  nth  root  of  a  number  is  one-nth  of  the 
logarithm  of  the  number,  or 

logb  Vm  =  -  logb  M. 
n 
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6.  The  Briggs  System  of  Logarithms 
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Returning  to  the  Briggs  system  of  logarithms  we  may  note 
that  the  logarithm  of  any  number  N  to  the  base  10  is  the  ex- 
ponent X  to  which  10  must  be  raised  to  produce  the  number  N. 


X  =  logio  N, 

W  =  N. 


Thus,  if 

then 

Inasmuch  as  10  is  always  the  base  here  considered  we  may  here- 
after write  more  briefly, 

X  =  log  N. 

From  the  above  definition  we  may  write  down  the  logarithms 
of  certain  numbers  at  once  as  shown  in  Table  9. 


Table  9.   Showing  the  Logarithms  of  Numbers  which  are 

Multiples  of  10 


Number 

Logarithm 

Authority 

100,000. 

5 

105  = 

100,000. 

10,000. 

4 

10^  = 

10,000. 

1,000. 

3 

103  = 

1,000. 

100. 

2 

102  = 

100. 

10. 

1 

101  = 

10. 

1. 

0 

100  = 

1. 

.1 

-1 

10-1  = 

.1 

.01 

-2 

10-2  = 

.01 

.001 

-3 

10-3  = 

.001 

.0001 

-4 

10-4  = 

.0001 

.00001 

-5 

10-5  = 

.00001 

The  logarithm  of  a  number  between  100  and  1000  will  evi- 
dently be  somewhere  between  2  and  3,  that  is,  some  fractional 
exponent.  The  number  having  the  logarithm  2.5,  for  example, 
may  be  found  by  taking  the  geometric  mean  of  100  and  1000,  or 
VlOO,000  =  316.2.   We  may  then  write  log  316.2  =  2.5. 

It  is  evident  that  logarithms  consist  of  an  integral  and  a 
decimal  part,  the  former  being  called  the  characteristic  and  the 
latter  the  mantissa  of  the  logarithm.  Thus  for  the  logarithm  of 
316.2  the  characteristic  is  2  and  the  mantissa  is  .5. 
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Very  complete  tables  of  mantissas  have  been  computed  as 
described  above  and  conveniently  tabled  for  use.  The  charac- 
teristic, it  will  be  noted,  may  always  be  obtained  by  inspection. 

In  order  to  illustrate  the  procedure  in  finding  the  complete 
logarithm  a  four-place  table  of  mantissas  is  given  on  pages  60 
and  61.  Let  the  logarithm  of  43.2  be  required.  Since  this  num- 
ber lies  between  10  and  100  its  logarithm  will  be  between  1  and 
2  and  hence  the  characteristic  is  1. 

The  mantissa,  or  decimal  part,  is  found  by  looking  down  the 
column  under  N  for  the  figures  43  and  then  proceeding  to  the 
right  until  the  column  headed  2  is  reached.  The  number  found 
is  6355.  The  decimal  points  have  been  omitted  in  the  table,  so 
that  the  complete  logarithm  is  1  +  .6355,  or  log  43.2  =  1.6355. 

If  the  number  had  been  4.32,  the  characteristic  would  have 
been  zero  and  the  mantissa  the  same  as  before.  Therefore, 
log  4.32  =  0.6355.  This  result  is  evident  from  the  laws  of  ex- 
ponents, for  if  ^^1 6355  ^  ^g  2, 

then  101-6355  _H  10  -    4.32, 

or  100.6355^    4  32. 

For  the  logarithm  of  .432,  the  characteristic  will  be  —  1  and 
the  mantissa  will  again  be  equal  to  +  .6355.  Instead  of  adding 
these  two  values  directly,  however,  it  is  found  more  convenient 
to  keep  the  mantissa  positive  and  write 

log  .432  =  9.6355  -  10, 

by  adding  and  subtracting  10  from  the  characteristic. 

The  general  rule  for  determining  the  characteristic  of  a  loga- 
rithm may  now  be  stated  as  follows:  The  characteristic  of  a 
number  greater  than  1  is  one  less  than  the  number  of  digits  to  the 
left  of  the  decimal  point;  while  the  characteristic  for  a  number  less 
than  1  is  negative  arid  one  greater  {numerically)  than  the  number 
of  zeros  between  the  decimal  point  and  the  first  significant  figure. 

In  looking  up  the  mantissa  of  a  number  the  rule  is  to  neglect 
the  decimal  point  and  find  the  nearest  mantissa  for  the  given 
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sequence  of  digits.  A  more  accurate  method  will  be  shown  in 
section  7,  where  linear  interpolation  is  presented. 

The  logarithms  of  the  following  numbers  should  now  be  veri- 
fied by  these  rules  and  Table  10. 

log  6.37  =  0.S041  log  .00004  =  5.6021  -  10 

log  .0637  =  8.8041  -  10  log  1910  =  3.2810 

log  .00637  =  7.8041  -  10  log  20000  =  4.3010 

log  1.01  =  0.0043  log  2  =  0.3010 

log  .001  =  7.0000  -  10  log  .999  =  9.9996  -  10 

A  few  short  calculations  may  now  be  illustrated  by  the  use  of 
logarithms.  Let  the  product  6.37  x  1910  be  required.  By  the 
first  law  of  the  preceding  section, 

log  (6.37  X  1910)  =  log  6.37  +  log  1910 

=  0.8041  +  3.2810  =  4.0851. 

The  number  coiTesponding  to  the  logarithm  4.0851  is  clearly 
between  10,000  and  100,000,  and  the  sequence  of  the  digits  is 
determined  by  the  mantissa  .0851.  The  nearest  mantissa  in 
Table  10  is  .0864,  con^esponding  to  the  number  122,  so  that 
the  required  product  to  three  figures  is  12,200.  By  direct  mul- 
tiplication the  product  is  12,166.70. 

The  steps  in  the  above  calculation  were  as  follows : 

1.  Finding  the  logarithms  of  the  factors  (.8041  and  3.2810), 

2.  Adding  these  logarithms  (4.0851), 

3.  Looking  for  the  number  (N)  corresponding  to  the  mantissa 
of  the  sum  of  the  logarithms  (122  corresponds  to  .0864),  and 

4.  Determining  the  number  of  places  in  the  result  by  noting 
the  characteristic  of  the  sum  of  the  logarithms  (characteristic 
4  gives  five  digits  before  decimal  point),  and  supphnng  zeros  for 
the  missing  digits.    (Answer  is  12.200.) 

Next  let  the  quotient  '     '^^  be  required.   Bv  the  second  law  of 

byjiO 

logarithms,  log  quotient  =  log  .0437  -  log  6920,  or  (8.6405  -  10) 

-  3.8401  =  4.8004  -  10.  The  reason  for  adding  and  subtracting 

10  for  negative  characteristics  now  becomes  apparent,  for  the 
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subtraction  may  be  made  continuous ;  that  is,  on  reaching  the 
decimal  point,  1  may  be  borrowed  from  the  8,  which  is  positive. 
The  difference  is  therefore  4.8004  —  10.  Looking  in  the  table 
for  the  mantissa  nearest  .8004  we  find  .8007,  which  corresponds 
to  the  sequence  of  digits  632.  The  characteristic  4  —  10,  or  —  6, 
shows  that  five  zeros  must  follow  between  the  decimal  point  and 
the  first  significant  figure  in  the  number.  The  required  quotient 
is  therefore  .00000632.  By  arithmetical  calculation  we  obtain 
.000006315+. 

The  great  convenience  of  logarithms  is  shown  especially  in 
raising  a  number  to  a  given  power.  If  (.642)^  be  required,  the 
third  law  of  logarithms  may  be  applied,  and  we  find  that 

log  (.642)6  =  6  log  .642  =  6(9.8075  -  10) 

=  58.8450  -  60 
=    8.8450-10. 

The  nearest  mantissa  is  .8451  for  N  =  700,  and  the  characteris- 
tic is  —  2.  The  answer  is  therefore  .0700.  By  multiplying  out 
(.642)  (.642)  •  .  •  to  six  factors  we  obtain  .07002. 

By  applying  the  fourth  law,  v  .777  may  be  found  as  follows : 


10, 


log  V?777  =  iV  log  .777  =  tV(9.8904  -  10) 

=  iV(99.8904  -  100) 
=  9.98904  -  10.  .. 

The  required  root  is  therefore  .975. 

7.  Interpolation 

A  graph  of  the  logarithm  function  y  =  logio  A^  may  be  made 
by  plotting  a  few  of  the  values  from  Table  10.    (See  Fig.  18.) 

The  logarithm  of  7  is  given  by  the  ordinate  .8451,  while  the 
logarithm  of  8  is  represented  by  y  =  .9031.  If  the  logarithms 
between  7  and  8  were  unknown,  an  approximation  to  the  loga- 
rithm of  7.5  could  be  obtained  by  assuming  that  the  function  is 
a  straight  line  over  this  interval  and  taking  the  ordinate  at  7.5 
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as  the  required  logarithm.  Graphically,  this  amounts  to  measur- 
ing the  ordinate  PQ  shown  in  Fig.  18.  Arithmetically,  the  pro- 
cedure is  to  take  half  the  sum  of  the  logarithms  of  7  and  8,  or 
.8741.  Reference  to  the  table,  however,  gives  log  7.5  =  .8751, 
so  that  there  is  an  error  of  .001  in  this  case. 

The  above  method  is  known  as  linear  interpolation  and  is 
extremely  useful  in  case  the  interval  over  which  the  interpo- 
lation is  carried  is  small.  In  such  cases  the  function  will  be 
so  nearly  linear  that  only  y 
a  slight  error  will  result,  i.o 
Values  of  the  function  be- 
tween those  given  in  the 
table  may  be  found,  and 
hence  a  greater  degree  of 
accuracy  may  be  obtained 
than  in  the  tabled  entries. 

Thus,  if  the  logarithm  of 
7.637  be  required,  the  log- 
arithms of  7.63  and  7.64 
may  be  found  in  Table  10 
and  the  extra  amount  for 
.007  found  by  interpolation. 
An  enlarged  portion  of  the 
graph  is  shown  in  Fig.  19.  The  difference  between  the  loga- 
rithm for  7.64  and  7.63  is  .0006  and  is  known  as  the  tabular 
difference.     From  similar  triangles  it  is  now  apparent  that 

-TT^  =  -WTTT'  or  c  =  .7(.0006)  =  .0004,  where  c  is  the  correction 

to  be  added  to  the  lower  tabular  value.  The  logarithm  of  7.637 
is  therefore  .8825  +  .0004  ==  .8829.  (From  a  seven-place  table 
the  logarithm  is  .8829228.) 

The  labor  of  computing  each  correction  is  saved  by  using  a 
table  of  proportional  parts  shown  at  the  right  of  the  main  figures 
in  Table  10.  In  finding  the  logarithm  of  7.637,  for  example,  it 
is  only  necessary  to  look  up  the  logarithm  of  7.63,  move  out  along 


0 

Fig.  18.   Graph  of  y  =  log  n  illustrating 
linear  interpolation  for  log  7.5 
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this  line  to  the  value  7  at  the  top  or  bottom  of  the  proportional 
parts,  and  read  off  the  entry  4.  This  last  result  is  to  be  added 
to  the  fourth  place  of  .8825,  giving  .8829  as  before.  In  a  simi- 
lar way,  the  logarithms  of  6.349  and  .04233  are  0.8027  and 
8.6266  —  10,  respectively. 

The  table  of  proportional  parts  is  also  useful  in  looking  up 
the  number  corresponding  to  a  given  logarithm.  This  may 
be  illustrated  by  the  following  problem : 

Find    the    product    of    .7437 
and  3.242. 

log  .7437=    9.8714-10 
log  3.242=    0.5108 


Log-. 

Tabular  differenccy 

.8831- 

1     / 

.8828- 

.0004 

.8825  - 

__,^' — 

J 

lO 

00 

I— I 
CO 

00 
00 

.8822  - 
.8819  - 

.007 

aaiR 

.OOlD  -^ 

7. 

1     1 
630 

1     1     1     1 

7.e 

)37     7.6^ 

to  No. 

Fig.  19.   Showing  linear  inter- 
polation for  log  7.637 


log  prod.  =  10.3822  -  10 

The  nearest  mantissa  smaller 
than  .3822  is  .3820,  which  corre- 
sponds to  the  number  241.  The 
difference  .0002  is  now  found  in 
the  proportional  parts  on  the 
same  line  and  by  moving  up  to  the  top  is  found  to  correspond 
to  1.  This  last  result  should  be  adjoined  to  the  three  figures 
already  found,  giving  as  the  required  number  2.411.  The  whole 
procedure  will  become  clearer  if  the  logarithm  of  2.411  is  now 
worked  out  as  shown  in  the  paragraph  above. 

The  method  of  linear  interpolation  will  be  sufficiently  accu- 
rate for  small  differences  in  logarithms  and  similar  functions 
where  one  (and  possibly  two)  places  beyond  those  given  in  the 
table  are  required.  Thus  with  Table  10  linear  interpolation  is 
adequate  for  the  logarithms  of  four-place  numbers,  and  with 
a  five-place  table  such  as  Taylor's  *  similar  interpolation  gives 
logarithms  of  five-place  numbers. 

More  exact  methods  of  interpolation  are  often  required  in 
advanced  statistical  work,  but  the  formulas  become  quite  com- 

*  Taylor,  Five-Place  Logarithmic  and  Trigonometric  Tables.  Ginn  and  Com- 
pany. This  table  is  especially  recommended  on  account  of  its  excellent  physical 
make-up  and  the  thumb  index  with  which  it  is  provided. 
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plicated  and  are  used  so  seldom  in  elementary  work  that  they 
are  omitted  here.  For  a  clear  account  the  student  should  con- 
sult Forsyth,*  and  for  more  advanced  treatment  an  excellent 
work  by  Whittaker  and  Robinson,  f 

8.  Some  Additional  Problems 

It  should  be  noted  that  the  operations  of  addition  and  sub- 
traction of  numbers  cannot  be  carried  out  by  logarithms.  Thus, 
if  the  problem  to  be  worked  out  is 

(6.743)  (89.24)  -  (36.5) 
475 

this  must  be  broken  up  into  two  parts  which  are  worked  sep- 
arately by  logarithms  and  combined  only  when  the  final  answers 
are  obtained.   The  work  will  then  be  as  follows : 

log  6.743  =  0.8289  log  36.5  =  11.5623  -  10 

log  89.24  =  1.9506  log  475  =    2.6767 

log  prod.  =  2.7795  log  quot.  =    8.8856  -  10 

log  475  =  2.6767  /.  quot.  =      .07684 
log  quot.  =  0.1028 

.-.  quot.  =  1.267 

The  required  answer  is  therefore  1.267  —  .077  =  1.190.  As  we 
shall  see  in  the  next  chapter,  such  a  result  should  not  be  carried 
beyond  four  figures. 

In  subtracting  the  logarithm  of  475  from  log  36.5  it  will  be 
noted  that  10  has  been  added  to  and  subtracted  from  the  char- 
acteristic of  the  latter  in  order  to  facilitate  the  final  subtraction 
of  the  logarithms. 

A  typical  problem  that  occurs  in  statistical  calculation  is  of 
the  form        

I  o  1  r       foToQ  /  Q7  \2 

5. 


S.D.= 


%-'" 


h,    for  example, 


13483      /37 


794       V794 


*  Forsyth.  Mathematical  Analysis  of  Statistics,  chap.  iii.   Wiley.  1924. 
t  Whittaker  and  Robinson,  The  Calculus  of  Observations.    D.  Van  Nostrand 
Company,  1924. 
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Table  10.    Four-Place  Logarithms  of  Numbers 


N 

0123456789 

12 

3 

4 

5 

6 

7 

8  9 

10 
11 
12 
13 
14 

0000  0043  0086  0128  0170  0212  0253  0294  0334  0374 
0414  0453  0492  0531  0569  0607  0645  0682  0719  0755 
0792  0828  0864  0899  0934  0969  1004  1038  1072  1106 
1139  1173  1206  1239  1271  1303  1335  1367  1399  1430 
1461  1492  1523  1553  1584  1614  1644  1673  1703  1732 

4  8  12  17  21  25  29  33  37 
4  8  11  15  19  23  26  30  34 
3  7  10  14  17  21  24  28  31 
3  6  10  13  16  19  23  26  29 
3  6  9  12  15  18  21  24  27 

15 
16 
17 
18 
19 

1761  1790  1818  1847  1875  1903  1931  1959  1987  2014 
2041  2068  2095  2122  2148  2175  2201  2227  2253  2279 
2304  2330  2355  2380  2405  2430  2455  2480  2504  2529 
2553  2577  2601  2625  2648  2672  2695  2718  2742  2765 
2788  2810  2833  2856  2878  2900  2923  2945  2967  2989 

3  6 
3  5 
2  5 

2  5 

24 

8  11  14  17  20  22  25 
8  11  13  16  18  21  24 
7  10  12  15  17  20  22 
7  9  12  14  16  19  21 
7  9  11  13  16  18  20 

20 
21 
22 
23 
24 

3010  3032  3054  3075  3096  3118  3139  3160  3181  3201 
3222  3243  3263  3284  3304  3324  3345  3365  3385  3404 
3424  3444  3464  3483  3502  3522  3541  3560  3579  3598 
3617  3636  3655  3674  3692  3711  3729  3747  3766  3784 
3802  3820  3838  3856  3874  3892  3909  3927  3945  3962 

2  4 
2  4 

24 
24 
24 

6 
6 
6 
6 
5 

8  11  13  15  17  19 
8  10  12  14  16  18 
8  10  12  14  15  17 
7  9  11  13  15  17 
7  9  11  12  14  16 

25 
26 
27 
28 
29 

3979  3997  4014  4031  4048  4065  4082  4099  4116  4133 
4150  4166  4183  4200  4216  4232  4249  4265  4281  4298 
4314  4330  4346  4362  4378  4393  4409  4425  4440  4456 
4472  4487  4502  4518  4533  4548  4564  4579  4594  4609 
4624  4639  4654  4669  4683  4698  4713  4728  4742  4757 

2  3 
2  3 
2  3 
2  3 
1  3 

5 
5 
5 
5 

4 

7 
7 
6 
6 
6 

9  10  12  14  15 
8  10  11  13  15 
8  9  11  13  14 
8  9  11  12  14 
7  9  10  12  13 

30 
31 
32 
33 
34 

4771  4786  4800  4814  4829  4843  4857  4871  4886  4900 
4914  4928  4942  4955  4969  4983  4997  5011  5024  5038 
5051  5065  5079  5092  5105  5119  5132  5145  5159  5172 
5185  5198  5211  5224  5237  5250  5263  5276  5289  5302 
5315  5328  5340  5353  5366  5378  5391  5403  5416  5428 

1  3 
1  3 
1  3 
1  3 
1  3 

4 
4 
4 
4 
4 

6 
6 
5 
5 
5 

7 
7 
7 
6 
6 

9  10  11  13 
8  10  11  12 
8  9  11  12 
8  9  10  12 
8  9  10  11 

35 
36 
37 
38 
39 

5441  5453  5465  5478  5490  5502  5514  5527  5539  5551 
5563  5575  5587  5599  5611  5623  5635  5647  5658  5670 
5682  5694  5705  5717  5729  5740  5752  5763  5775  5786 
5798  5809  5821  5832  5843  5855  5866  5877  5888  5899 
5911  5922  5933  5944  5955  5966  5977  5988  5999  6010 

1  2 
1  2 
1  2 
1  2 
1  2 

4 
4 
3 
3 
3 

5 
5 
5 
5 

4 

6 
6 
6 
6 
5 

7 
7 
7 
7 
7 

9  10  11 
8  10  11 
8  9  10 
8  9  10 
8  9  10 

40 
41 
42 
43 

44 

6021  6031  6042  6053  6064  6075  6085  6096  6107  6117 
6128  6138  6149  6160  6170  6180  6191  6201  6212  6222 
6232  6243  6253  6263  6274  6284  6294  6304  6314  6325 
6335  6345  6355  6365  6375  6385  6395  6405  6415  6425 
6435  6444  6454  6464  6474  6484  6493  6503  6513  6522 

1  2 
1  2 
1  2 
1  2 
1  2 

3 
3 
3 
3 
3 

4 
4 
4 
4 
4 

5 
5 
5 
5 
5 

6 
6 
6 
6 
6 

8 

7 
7 
7 
7 

9  10 
8  9 
8  9 
8  9 
8  9 

45 
46 
47 
48 
49 

6532  6542  6551  6561  6571  6580  6590  6599  6609  6618 
6628  6637  6646  6656  6665  6675  6684  6693  6702  6712 
6721  6730  6739  6749  6758  6767  6776  6785  6794  6803 
6812  6821  6830  6839  6848  6857  6866  6875  6884  6893 
6902  6911  6920  6928  6937  6946  6955  6964  6972  6981 

1  2 
1  2 
1  2 
1  2 
1  2 

3 
3 
3 
3 
3 

4 
4 
4 
4 
4 

5 
5 
5 

4 
4 

6 
6 
5 
5 
5 

7 
7 
6 
6 
6 

8  9 

7  8 
7  8 
7  8 
7  8 

50 
51 
62 
63 
64 

6990  6998  7007  7016  7024  7033  7042  7050  7059  7067 
7076  7084  7093  7101  7110  7118  7126  7135  7143  7152 
7160  7168  7177  7185  7193  7202  7210  7218  7226  7235 
7243  7251  7259  7267  7275  7284  7292  7300  7308  7316 
7324  7332  7340  7348  7356  7364  7372  7380  7388  7396 

1  2 
1  2 
1  2 
1  2 
12 

3 
3 
2 

2 
2 

3 
3 
3 
3 
3 

4 
4 
4 
4 
4 

5 
5 
5 
5 
5 

6 
6 
6 
6 
6 

7  8 
7  8 
7  7 
6  7 
6  7 

N 

0123456789 

1  2 

3 

4 

5 

6 

4 

8  9 

The  proportional  parts  are  stated  in  full  for  every  tenth  at  the  right-hand 
side.  The  logarithm  of  any  number  of  four  significant  figures  can  he  read 
directly  by  adding  the  proportional  part  corresponding  to  the  fourth  figure 
to  the  tabular  number  corresponding  to  tlic  first  three  figures. 


LOGARITHMS 


61 


Table  10.    Four-Place  Logarithms  of  Numbers  (Continued) 


0 


2 


8 


123456789 


7404  7412 
7482  7490 
7559  7566 
7634  7642 
7709  7716 

7782  7789 
7853  7860 
7924  7931 
7993  8000 
8062  8069 

8129  8136 
8195  8202 
8261  8267 
8325  8331 
8388  8395 

8451  8457 
8513  8519 
8573  8579 
8633  8639 
8692  8698 


7419  7427 
7497  7505 
7574  7582 
7649  7657 
7723  7731 


7435  7443 
7513  7520 
7589  7597 
7664  7672 
7738  7745 


7451  7459 
7528  7536 
7604  7612 
7679  7686 
7752  7760 


7796  7803  7810 
7868  7875  7882 
7938  7945  7952 
8007  8014  8021 
8075  8082  8089 


7818  7825  7832 
7889  7896  7903 
7959  7966  7973 
8028  8035  8041 
8096  8102  8109 


8142  8149 
8209  8215 
8274  8280 
8338  8344 
8401  8407 

8463  8470 
8525  8531 
8585  8591 
8645  8651 
8704  8710 


8156  8162 
8222  8228 
8287  8293 
8351  8357 
8414  8420 

8476  8482 
8537  8543 
8597  8603 
8657  8663 
8716  8722 


8169  8176 
8235  8241 
8299  8306 
8363  8370 
8426  8432 

8488  8494 
8549  8555 
8609  8615 
8669  8675 
8727  8733 


7466  7474 
7543  7551 
7619  7627 
7694  7701 
7767  7774 

7839  7846 
7910  7917 
7980  7987 
8048  8055 
8116  8122 

8182  8189 
8248  8254 
8312  8319 
8376  8382 
8439  8445 

8500  8506 
8561  8567 
8621  8627 
8681  8686 
8739  8745 


8751  8756  8762  8768  8774  8779  8785  8791  8797  8802 
8808  8814  8820  8825  8831  8837  8842  8848  8854  8859 
8865  8871  8876  8882  8887  8893  8899  8904  8910  8915 
8921  8927  8932  8938  8943  8949  8954  8960  8965  8971 
8976  8982  8987  8993  8998  9004  9009  9015  9020  9025 


9031  9036 
9085  9090 
9138  9143 
9191  9196 
9243  9248 


9042  9047 
9096  9101 
9149  9154 
9201  9206 
9253  9258 


9053  9058 
9106  9112 
9159  9165 
9212  9217 
9263  9269 


9063  9069 
9117  9122 
9170  9175 
9222  9227 
9274  9279 


9294  9299  9304 
9345  9350  9355 
9395  9400  9405 
9445  9450  9455 
9494  9499  9504 

9542  9547  9552 
9590  9595  9600 
9638  9643  9647 
9685  9689  9694 
9731  9736  9741 


9309  9315 
9360  9365 
9410  9415 
9460  9465 
9509  9513 


9320  9325  9330 
9370  9375  9380 
9420  9425  9430 
9469  9474  9479 
9518  9523  9528 


9557  9562  9566 
9605  9609  9614 
9652  9657  9661 
9699  9703  9708 
9745  9750  9754 


9571  9576 
9619  9624 
9666  9671 
9713  9717 
9759  9763 


9777  9782  9786  9791  9795  9800  9805  9809 
9823  9827  9832  9836  9841  9845  9850  9854 
9868  9872  9877  9881  9886  9890  9894  9899 
9912  9917  9921  9926  9930  9934  9939  9943 
9956  9961  9965  9969  9974  9978  9983  9987 

0123466    7 


9074  9079 
9128  9133 
9180  9186 
9232  9238 
9284  9289 

9335  9340 
9385  9390 
9435  9440 
9484  9489 
9533  9538 

9581  9586 
9628  9633 
9675  9680 
9722  9727 
9768  9773 

9814  9818 
9859  9863 
9903  9908 
9948  9952 
9991  9996 

8   9 


0  112 


5 
5 
5 

4 

4 

4 

4 
4 
4 
4 

4 
4 
4 
4 
4 

4 

4 
4 
4 
4 

3 
3 
3 
3 
3 

3 
3 
3 
3 
3 

3 
3 
3 
3 
3 

3 
3 
3 
3 
3 

3 
3 
3 
3 
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5 
5 
5 
5 
5 

5 
5 
5 
5 
5 

5 
5 
5 

4 
4 

4 

4 
4 
4 
4 

4 
4 
4 
4 
4 

4 

4 
4 
4 
4 

4 
4 
3 
3 
3 

3 
3 
3 
3 
3 

3 
3 
3 
3 
3 


4  5 

4  5 

4  4 

4  4 

4  4 


123456789 
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With  a  machine  or  even  by  long-hand  arithmetic,  this  result 
may  be  worked  out  quite  rapidly  if  a  good  table  of  squares  is 
used.  The  logarithmic  calculation  is  a  little  awkward  on  ac- 
count of  the  subtraction  under  the  radical,  but  inasmuch  as 
some  students  find  it  convenient  the  method  may  be  illustrated 
as  follows : 

log  3483  =  3.5420  log  37    -11.5682-10 

log  794    -2.8998  log  794  -    2.8998 

log  ^-0.6422  1^^^   =    8.6684-10 


N  log  C2  =  17.3368  -  20 

S_ 

N 


S  ^  4  33^  .-.  C2  =  .002 


^  -  C2  =  4.385 

N 

log  (j-  -  C^)  =  0.6420 

log  ^^-C2  =  0.3210 

logh       =    .6990 

log  S.D,=  1.0200         .-.  S.D.  =  10.47 

Another  calculation  may  be  illustrated  by  the  formula 

S        T^  .  743.2 

r  =  — For  example,  r  = 


N(j,(Ty  ^    '         682(2.673)  (2.794) 

This  is  readily  adapted  to  logarithmic  work : 

log  682     =  2.8338  log  743.2  =  12.8711  -  10 

log  2.673  -  0.4270  log  prod.  =    3.7070 

log  2.794  =  0.4462  log  r         =    9.1641  -  10 
log  prod.  =  3.7070  .*.  r  =      .1459 

A  final  problem  may  be  worked  out  in  the  case  of  the  geo- 
metric mean  of  several  quantities  where 

G.M.  =  -v^Xi  •  X2  •  X:,  •  •  •  Xn 
for  example,  G.M.  =  \/27.4  x  29.5  x  28.3  X  29.2  x  29.9 


LOGARITHMS 


63 


We  therefore  have : 


log  27.4 
log  29.5 
log  28.3 
log  29.2 
log  29.9 


1.4378 
1.4698 
1.4518 
1.4654 
1.4757 


log  prod.  =  7.3005 
I  log  prod.  =  1.4601 
.-.  G.M.  =  28.8 


EXERCISES 

1.  Find  the  logarithms  of  the  following  numbers  by  a  four-place 
table.  Check  your  results  by  referring  to  a  five-place  table :  634.2, 
59.61,  1.722,  .004359,  .1166,  .00004795,  5566.,  6234000. 

2.  Work  out  the  following  operations  by  logarithms : 

^l437.1 


(1) 


(4) 


.6432  X  .03475 
6.742 


3472)2(.6745) 


(2) 


3622. 


(3) 


V472  X  347 


p 


-1^'      (5)V'67  X  68  X  69  X  70* 


(1.342)3 
Ans.    (1)  .003315,  (2)  .7030,  (3)  .00247,  (4)  .0001066,  (5)|gg'^g;^. 

3.  Calculate  the  standard   deviations   with  the  following  data, 


1 68.49. 


using  the  formula  S.  D.  = 

4l^-'\ 

h. 

S 

N 

c 

k 

S.D.  (Ans.) 

(1)      4732 

462 

.0123 

5 

16.0 

(2)      1692 

192 

1.1340 

3 

8.23 

(3)      1573 

641 

.843 

0.25 

0.330 

4.  Compute  the 

correlal 

:ions  for  the  data  below,  using 

the  formuh 

(.1 

^~  y/hc 

a 

b 

c 

r  (Ans.) 

(1)       176 

235 

182 

.851 

(2)       234 

234 

259 

.951 

(3)       193 

291 

279 

.677 

(4)    -64.2 

173.3 

1892 

-  .112 

(5)      831 

831 

831 

1.000 
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6.  Compute  the  geometric  means  for  the  following  series : 

a.  169,  171,  165,  168,  173,  175,  170.    (170.1.   Ans.) 

h.  33.1,  34.2,  33.4,  34.5,  33.6,  34.7,  34.8,  33.9.    (34.0.  Ans.) 

6.   Calculate  ri2.3  by  the  following  formula : 

^12  —  ^13  ^23 


ri2.3  = 


Vl-rf3     Vl-ri3 


Use  Holzinger's  *  Table  VII  for  log  Vl  —  r^. 

ri2   ri3        r23   ri2.3  (Ans.)  ri2    ris    r23   ri2.3  (Ans.) 

(1)  .82  .16  .17  .815     (4)  .431  .327  .214  .391 

(2)  .09  .16  .17  .065     (5)  .647  .832  .725  .115 

(3)  .80  .80  .80  .444     (6)  .932  .327  .214  .934 

7.   Using  the  formula 

1  -  Ru234)  =  (1  -  rf2)(l  -  r?3.2)(l  -  rf4.23), 
work  out  the  values  of  R  1(234)  with  the  aid  of  Holzinger's  Table  VI. 


ri2 

ri3.2 

^14.23 

i2i(234)  (Ans.) 

(1)  .791 

.620 

.474 

.906 

(2)  .833 

.695 

.347 

.928 

(3)  .755 

.062 

-.007 

.756 

(4)  .815 

.742 

.676 

.958 

*  Karl  J.  Holzinger,  Statistical  Tables  for  Students  in  Education  and  Psychol- 
ogy.  The  University  of  Chicago  Press,  1925. 


CHAPTER  V 

ERRORS  IN  CALCULATION  AND  MEASUREMENT 

1.  Accuracy  in  Statistical  Method 

In  dealing  with  statistical  material  it  is  desirable  to  recognize 
very  early  the  importance  of  accuracy  not  only  in  the  calculations 
which  need  to  be  performed  but  in  the  data  themselves.  The 
student  should  train  himself  to  be  accurate  in  his  computations 
and  to  employ  adequate  checks  wherever  possible.  He  should 
also  be  cautious  as  to  accuracy  of  the  data  which  he  is  using, 
in  order  to  safeguard  against  making  unwarranted  conclusions 
from  the  results  obtained. 

Actual  blunders  in  calculation  can  best  be  obviated  by  ex- 
treme care  and  adequate  methods  of  checking  all  of  the  com- 
putations. Even  with  such  mistakes  eliminated,  however,  it  is 
necessary  to  be  cautious  regarding  the  number  of  places  to  use 
in  order  to  obtain  a  result  to  a  given  degree  of  accuracy.  The 
distinction  between  different  types  of  error  is  also  important. 
For  these  reasons  the  present  chapter  will  be  devoted  to  some 
of  the  simplest  principles  involved  in  errors  of  calculation  and 
measurement. 

2.  Absolute  and  Relative  Errors 

An  error  may  be  defined  as  the  discrepancy  between  the  ob- 
tained and  the  true  values  from  a  numerical  process  or  meas- 
urement. If  Xi  be  an  obtained  value  and  X  the  true  value, 
the  difference,  Ei  =  Xi  —  Xy  is  known  as  the  absolute  error.  The 
ratio  of  the  absolute  error  to  the  true  value,  or  Ei/X,  is  called 
the  relative  error.  For  example,  suppose  the  true  value  of  X  is 
67.5  inches,  and  measurements  Xi  =  66.9  inches  and  X2  =  69.7 

inches   have   been   made.     The  two  absolute  errors  will   be 
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Ei  =  —  0.6  and  E2  =  +  2.2,  while  the  corresponding  relative 
eiTors  will  be  —  .01  and  +  .03,  or  —  1  per  cent  and  +  3  per  cent. 

Whenever  values  are  obtained  from  the  measurements  of  some 
continuous  variables  such  as  height,  they  can  never  be  exact  nor 
can  their  true  value  ever  be  determined.  All  such  values,  includ- 
ing the  errors  themselves,  must  be  approximations.  The  best 
that  can  be  done  is  to  measure  to  a  certain  degree  of  accuracy, 
take  the  average  of  a  number  of  observations  as  an  approxima- 
tion to  the  true  value,  and  consider  the  variations  from  this 
result  as  errors.  Thus  suppose  a  stick  is  measured  ten  times  to 
the  nearest  millimeter  and  the  following  observations  are  re- 
corded :  57,  58,  58,  56,  57,  60,  57,  55,  56,  56.  Their  average,  or 
57,  might  be  taken  as  the  true  or  most  typical  value,  and  the 
variations  0,  +1,  +1,  —  1,  0,  +3,  0,  —  2,  —  1,  —  1  would  be 
considered  as  absolute  errors  although  they  are  themselves  only 
approximations  to  the  true  errors. 

In  case  we  are  dealing  with  a  discrete  series  such  as  the  num- 
bers of  pupils  in  various  school  grades  the  resulting  obsers^ations 
of  grade  size  may  be  considered  as  exact.  It  should  be  noted, 
however,  that  the  unit  of  tabulation  in  such  a  series  is  the  pupil, 
and  that  these  units  are  equal  to  one  another  only  in  a  very 
limited  sense,  that  is,  as  human  entities. 

3.  Biased  and  Unbiased  Errors 

Errors  which  tend  to  compensate  or  offset  one  another  in  the 
long  run  are  known  as  unbiased  or  compensating  errors.  A  good 
example  is  furnished  by  the  rounding  off  of  numbers  to  a  smaller 
number  of  places  as  in  Table  11. 

In  rounding  off  the  numbers  to  the  nearest  thousand,  figures 
less  than  500  are  discarded  and  those  greater  than  500  are  con- 
sidered as  1000.  If  figures  had  occurred  at  exactly  500,  they 
would  have  been  equally  divided  above  and  below,  or  in  case  of  a 
single  such  number,  1000  would  have  been  added.  In  the  table  on 
page  67  the ' '  errors  "in  rounding  were  -  347,  - 143,  +  365,  +  228, 
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Table  11.    Average  Daily  School  Attendance  in  the 
United  States,  1870-1910 


Year 


Children  in 
A.D.A. 


1870 
1880 
1890 
1900 
1910 


Total 

41.835.204 

41,835 

Average 

'        8.367.041 

8,367 

Thousands  of 

Children  in 

A.D.A. 


4,077,347 

4,077 

6,144,143 

6,144 

8,153,635 

8,154 

10,632,772 

10.633 

12,827,307 

12,827 

and  —  307,  with  an  algebraic  total  of  —  204.  Even  with  so  short 
a  series  the  total  error  was  relatively  small,  and  for  a  longer  list 
of  numbers  it  would  tend  to  become  less  because  of  the  random 
distribution  of  digits  greater  and  less  than  five. 

Unbiased  errors  are  very  important  in  the  theory  of  averages 
because  their  balancing  effect  will  tend  to  make  the  average 
more  accurate  than  the  original  numbers.  Thus,  in  Table  11,  the 
absolute  error  in  the  rounded  average  is  only  41,  which  is  much 
less  than  that  of  any  individual  item.  As  a  caution  it  should 
be  added  that  the  above  short  series  happened  to  illustrate 
the  above  principles 
and  was  therefore 
chosen.  In  general  a 
longer  series  would 
be  required  to  secure 
any  considerable  bal- 
ancing of  errors. 

Biased  errors  are 
those  which  do  not 
tend    to    neutralize 

one  another  but  accumulate  in  such  a  way  as  to  produce  a 
relatively  large  error  in  the  total  or  average.  Such  errors  are 
illustrated  in  Fig.  20  by  the  marks  of  two  teachers. 


E  D  C   B  A 

By  regular  teacher 


E    D    C    B    A 

By  substitute  teacher 


Fig.  20.    Distribution   of   grades  of  a  class  of 
pupils  by  a  regular  teacher  and  by  a  substitute 

Based  on  one  month's  class  work  and  five  tests 
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If  biased  errors  are  present  in  a  series  of  observations,  the 
average  will  tend  to  be  as  inaccurate  as  the  individual  measure- 
ments upon  which  it  is  based.  Suppose  that  a  meter  stick  is  one 
centimeter  shorter  than  the  standard.  All  measurements  with 
it  will  have  a  relative  error  of  1  per  cent  in  the  same  direction 
and  the  average  will  be  likewise  affected  as  illustrated  in  the 
following  table : 


Table  12.   Hypothetical  Measurements  with  Constant  Error 

OF  1  Per  Cent 

Observed  Measurement 

Constant  Error 

69 
71 
69 
68 
73 

+  .69 
+  .71 
+  .69 
+  .68 
+  .73 

Total    ....   350 

+  3.50 

Average    ...     70 

+  .70 

4.  Significant  Figures 

The  digits  in  a  numerical  result  which  are  known  to  be  correct 
are  called  significant  figures.  Thus,  if  a  measurement  such  as 
39.6  mm.  be  made,  it  is  assumed  to  be  correct  to  the  nearest 
tenth  of  a  millimeter  and  is  said  to  have  three  significant  figures, 
the  true  value  lying  anywhere  between  39.55  mm.  and  39.65  mm. 
If  the  same  result  is  expressed  as  .0396  meter,  it  is  still  to  be  con- 
sidered as  correct  to  three  figures,  the  zero  after  the  decimal 
point  merely  serving  to  fill  a  space.  When  zeros  occur  on  the 
right  of  a  series  of  digits  the  significant  figures  may  be  shown  by 
the  use  of  a  decimal  point.  For  example,  a  measurement  such 
as  2600.  is  correct  to  four  figures  or  is  between  2599.5  and  2600.5, 
while  2600  is  correct  to  only  two  figures  and  lies  between  2550 
and  2650.  By  way  of  further  illustration,  the  following  num- 
bers would  all  be  considered  as  correct  to  five  significant  figures : 
47.234,  .00036924,  .0042000,  4349.0,  1000.0,  956340,  1.0000. 


ERRORS 


69 


5.  Arithmetical  Computation  with  Rounded  Numbers 

Consider  the  following  series  of  products  with  successively 
rounded  values  of  tt  =  3.1415927  and  e  =  2.7182818,  whose 
product,  correct  to  eight  significant  figures,  is  8.5397342. 


TT  X  e 

Product 

Correct  Value 

(3.1415927)(2.7182818) 

=  8.53973425942286 

8.5397342 

(3.141593)(2.718282) 

=  8.539735703226 

8.539734 

(3.14159)(2.71828) 

=  8.5397212652 

8.53973 

(3.1416)(2.7183) 

=  8.53981128 

8.5397 

(3.142)(2.718) 

=  8.539956 

8.540 

(3.14)(2.72) 

=  8.5408 

8.54 

(3.1)(2.7) 

=  8.37 

8.5 

(3) (3) 

=  9. 

9. 

The  bold-faced  figures  are  those  which  agree  with  the  correct 
values  on  the  right  when  the  remaining  digits  are  consolidated. 
Thus  the  first  product  is  correct  to  seven  significant  figures  only, 
for  if  rounded  one  place  further  to  the  right  there  would  have 
been  an  error  of  1  in  the  seventh  decimal  place,  that  is, 
8.5397343  instead  of  8.5397342.  Of  the  remaining  products 
only  three  are  correct  to  as  many  significant  figures  as  occur  in 
each  factor,  while  three  others  are  correct  to  one  less  figure. 
The  table  illustrates  the  rule  that  it  is  not  safe  to  carry  out  the 
product  of  two  such  factors  beyond  the  number  of  significant 
figures  included  in  each. 

The  same  principle  may  be  illustrated  in  another  way.  Sup- 
pose that  the  product  of  36.9  by  8.74  is  required,  both  factors 
being  correct  to  three  significant  figures.  The  obtained  product 
is  322.506.  The  maximum  product  is  36.95  X  8.745,  or  323.12775, 
while  the  minimum  product  is  36.85  X  8.735,  or  321.88475.  In 
this  problem  it  is  therefore  doubtful  whether  the  correct  answer 
is  322  or  323.  To  give  the  result  to  two  significant  figures,  as 
320,  would  not  be  desirable,  for  both  maximum  and  minimum 
products  exceed  the  value.  The  answer  323  is  to  be  preferred 
because  it  is  nearer  the  average  of  the  extreme  products  and 
therefore  more  probably  correct  than  322. 
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When  several  factors  are  involved  the  rounding  errors  will 
tend  to  offset  one  another,  but  in  case  they  do  not,  the  error  in 
the  product  may  be  relatively  great,  as  illustrated  by  the  fol- 
lowing example : 

Rounded  value 34     x  .129    x  6.784    =  29.754624  ^  30 

Maximum  value       34.5  x  .1295  x  6.7845  =  30.311450  =  30 

Minimum  value 33.5  x  .1285  x  6.7835  =  29.201272  =  29 

The  best  answer  for  this  particular  problem  is  probably  30, 
but  if  a  dozen  more  items  were  included  the  product  might  be 
given  to  one  more  significant  figure  than  in  the  item  with  the 
least  significant  figures. 

In  the  case  of  division,  similar  reasoning  may  be  applied. 
Consider,  for  example,  the  quotient  8.47  -^  23  =  .368.  The 
maximum  and  minimum  values  are  8.475  -.-  22.5  =  .377  and 
8.465  ^  23.5  =  .360,  respectively.  Here  there  is  such  wide  varia- 
tion in  the  third  figure  that  it  could  not  be  safely  retained  in 
the  answer.  The  quotient  .37  is  probably  best  as  an  average 
between  the  extreme  values  .38  and  .36. 

The  general  rule,  then,  is  that  a  product  or  a  quotient  with 
rounded  numbers  should  not  he  written  to  more  significant  figures 
than  occur  in  the  item  with  the  least  significant  figures. 

For  addition  and  subtraction  it  is  necessary  to  consider  the 
accuracy  of  the  items  rather  than  the  number  of  significant 
figures.  Consider  the  measurements  624.2  feet  and  49.17  feet. 
The  maximum  error  in  the  former  is  .05  feet  and  in  the  latter 
.005  feet.  The  maximum  and  minimum  sums  are  therefore 
624.25  +  49.175  =  673.425  feet  and  624.15  +  49.165  =  673.315 
feet.  The  sum  is  thus  probably  correct  to  only  one  decimal 
place  and  is  obtained  by  rounding  49.17  to  49.2  and  writing 
624.2  +  49.2  =  673.4  feet. 

When  several  items  are  added  it  is  probably  best  to  round 
them  at  once  to  the  number  of  decimal  places  in  the  least  accu- 
rate measurement.  If  compensating  errors  occur,  they  will  off- 
set one  another  in  the  rounding  of  the  items  just  as  well  as  if 
more  places  had  been  retained  and  the  sum  rounded. 
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Hypothetical  Problem  Illustrating  Rounding  in  Sums 


Original  Items 

Rounded  to  Two  Decimal 
Places 

Rounded  to  One  Decimal 
Place 

67.432 

67.43 

67.4 

9.64 

9.64 

9.6 

10.4 

10.4 

10.4 

8.356 

8.36 

8.4 

17.9 

17.9 

17.9 

6.566 

6.57 

6.6 

8.327 

8.33 

8.3 

7.463 

7.46 

7.5 

29.638 

29.64 

29.6 

19.784 

19.78 

19.8 

Total    185.506 

185.51 

185.5 

It  is  readily  verified  that  the  maximum  and  minimum  sums 
are  185.6145  and  185.3975.  The  answer  185.5  is  therefore  the 
best,  and  it  may  be  obtained  as  well  from  the  last  column  of 
figures  as  from  the  second  where  the  items  have  been  carried  to 
two  decimal  places  and  the  sum  rounded  to  one. 

In  the  case  of  a  square  root  like  V4986.1  -^  827,  the  division 
under  the  radical  should  be  carried  to  five  significant  figures  if 
only  the  numerator  is  subject  to  error  and  the  denominator  is 
exact  (see  Standard  Deviation).   This  gives  \/6.0291  =  2.46. 


6.  Logarithmic  Computation  with  Rounded  Numbers 

Inasmuch  as  a  good  share  of  the  students'  calculations  may 
be  performed  with  the  aid  of  logarithms  it  may  be  well  to  dis- 
cuss briefly  their  use  with  rounded  numbers.  As  an  illustration 
let  the  product  3.47  x  8.96  be  required.  The  maximum  and 
minimum  factors  /i  and  /2,  their  logarithm's,  and  the  resulting 
products  may  be  set  down  as  follows : 


/i 

/2 

LOG/i 

Log/2 

LOG/l/2 

/l/2 

Maximum    . 
Actual  .    .    . 
Minimum 

3.475 

3.47 

3.465 

8.965 

8.96 

8.955 

.5409548 
.5403295 
.5397032 

.9525503 
.9523080 
.9520656 

1.4935051 
1.4926375 
1.4917688 

31.153 
31.091 
31.029 
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The  best  or  most  probable  answer  is  31.1,  which  is  the  product 
of  3.47  and  8.96  carried  to  three  significant  figures,  and  in  order 
to  obtain  it  four-place  logarithms  are  as  satisfactory  as  the 
seven-place.  The  abbreviated  computation  would  then  be 


log  3.47 
log  8.96 


.5403 
.9523 


log  prod. 
.*.  prod. 


1.4926 
31.1. 


When  four  significant  figures  are  involved,  four-place  loga- 
rithms may  be  employed,  but  a  five-place  table  is  much  more 
convenient  because  no  interpolation  is  necessary  if  the  entries 
for  N  are  given  to  four  places.  For  example,  the  com.putation 
of  the  product  123.7  by  96.45  may  be  done  in  either  of  the  fol- 
lowing ways : 


With  a  Four-Place  Table 
AND  Interpolation 

log  123.7    =  2.0923 

log    96.45  =  1.9843 

log  prod.    =  4.0766 

.-.  prod.    =  11,930. 


With  a  Five-Place  Table 
(No  Interpolation) 

log  123.7    =  2.09237 

log    96.45  =  1.98430 

log  prod.    =  4.07667 

.-.prod.    =  11,930. 


With  a  product  such  as  34.79  by  7643.29,  the  second  factor 
should  be  consolidated  to  7643  or  7643.3  and  a  five-place  table 
of  logarithms  employed. 

The  general  rule  that  will  apply  also  in  the  case  of  division  is 
that  when  n  is  the  least  number  of  figures  to  which  any  of  the  items 
is  correct,  an  n  or  at  most  an  n  -\- 1  place  logarithm  table  should 
be  used. 

In  logarithmic  calculation  involving  formulas  the  same  general 
rule  may  be  followed.  Thus  in  the  case  of  the  functions  1  —  r- 
and  Vl  —  r^,  which  occur  very  frequently,  three-place,  four- 
place,  or  five-place  logarithm  tables  will  be  ample  when  the 
values  of  r  are  given  to  two,  three,  and  four  places,  respectively. 
The  following  calculation  illustrates  the  variations  which  may 
occur  in  the  numbers  and  logarithms : 
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r 

1-  r2 

Log  (1  -  r2) 

* 

Value 

Min. 

Error 

Value 

Max. 

Error 

Value 

Max. 

Error 

.18 
.50 
.82 
.99 

.175 
.495 
.815 
.985 

.005 
.005 
.005 
.005 

.9676 
.7500 
.3276 
.0199 

.9694 
.7550 
.3358 
.0298 

.0018 
.0050 
.0082 
.0099 

9.9857 
9.8751 
9.5153 
8.2989 

9.9865 
9.8779 
9.5261 
8.4739 

.0008 
.0028 
.0108 
.1750 

It  will  be  noted  that  when  the  value  of  r  is  given  correct  to  two 
places,  the  logarithm  of  (1  —  r^)  may  have  an  error  in  the  first, 
second,  or  third  place,  etc.,  depending  upon  the  size  of  r.  Three- 
place  logarithms  of  (1  —  r^)  would  therefore  be  sufficient  for 
such  problems. 

In  a  similar  way  it  may  be  shown  that  while  a  product  such 
as  [1  —  (.856)^] [1  —  (.943)2]  j^^y  have  a  rounding  error  of  only 
.00035  there  may  be  an  error  of  .005  in  its  logarithm,  due  to 
an  addition  of  the  errors  in  the  two  factors,  as  shown  below. 

[1  -  (.8565)2] [1  _  (.9435)2]  =  .02925, 

[1  -  (.856)2] [1  _  (.943)2]  =  .02960, 

[1  -  (.8555)2] [1  -  (.9425)2]  =  .02995. 

Maximum  rounding  error  =  .00035. 

log  [1- (.856)2]  =  9.42694-10    log  [1- (.8555)2]  =  9.42833-10 
log  [1  -  (.943)2]  ^  9.04435  - 10    log  [1  -  (.9425)2]  =  9.04803  - 10 
log  prod.  =  8.47129  - 10  log  prod.  =  8.47636  - 10 

Maximum  error  in  log  prod.  =  .00507. 

The  product  to  be  chosen  is  surely  right  when  written  to  one 
significant  figure,  as  .03,  but  it  might  be  correct  to  three  sig- 
nificant figures,  as  .0296,  on  account  of  compensating  errors. 

In  case  it  is  desired  to  have  the  final  answer  for  a  problem 
correct  to  n  significant  figures,  it  is  usually  best  to  begin  with 
the  items  correct  to  ?z  -|- 1  significant  figures  and  use  n  +  2  place 
tables  in  the  computation. 


*  Karl  J.  Holzinger,  Statistical  Tables  for  Students  in  Education  and  Psychology. 
The  University  of  Chicago  Press,  1925.   See  Table  VI  for  log  (1  -  r2). 
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7.  Errors  in  Educational  Measurement 

The  errors  discussed  thus  far  have  been  due  chiefly  to  the 
rounding  of  approximate  measurements.  They  are  not  peculiar 
to  any  one  field,  but  occur  whenever  measurements  or  observa- 
tions are  made  and  should  be  taken  into  account  in  the  sub- 
sequent calculations.  Being  unbiased  in  character  their  effect 
upon  the  final  result  may  be  controlled  by  care  in  the  arith- 
metical operations  as  described  above.  The  present  section  will 
be  concerned  with  errors  which  occur  in  the  measurement  of 
mental  characters. 

One  difference  between  mental  and  physical  measurements 
arises  from  the  nature  of  the  scales  employed.  Arithmetical  abil- 
ity, for  example,  is  a  very  complex  character  and  its  resolution 
into  component  abilities  such  as  those  of  addition  and  multipli- 
cation is  at  best  a  matter  of  convenience  because  each  of  these 
is  a  combination  of  still  more  specific  abilities.  A  unit  of  such 
arithmetical  ability  can  therefore  never  be  quite  the  equivalent 
of  another  unit  in  the  arithmetical  scale  in  the  same  way  that 
an  inch  of  height  is  the  equivalent  of  another  inch  of  height. 
Even  two  problems  alike  in  type  and  equally  difficult  for  a  large 
group  may  not  be  equally  difficult  for  a  single  pupil.  The  inch, 
on  the  other  hand,  has  the  same  significance  for  the  individual 
measurement  as  in  the  group. 

This  lack  of  equivalence  of  test  units  is  closely  related  to 
another  difference  between  mental  and  physical  scales.  The 
complete  measurement  of  a  mental  trait  is  probably  impossible, 
because  the  test  must  always  be  based  on  a  sampling  of  the  total 
available  material.  Spelling  ability,  for  example,  may  be  meas- 
ured by  a  number  of  well-known  scales,  but  no  single  test  nor 
the  combination  of  several  tests  will  give  a  complete  measure 
of  spelling  ability.  These  tests,  moreover,  will  be  only  roughly 
comparable  because  different  words  and  methods  of  testing  are 
employed.  An  approximate  transmutation  from  one  mental 
scale  to  another  is  always  possible,  but  nothing  approaching 
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the  exactness  with  which  inches  may  be  converted  into  cen- 
timeters can  probably  ever  be  attained  in  the  case  of  mental 
measurements. 

The  examiner  or  observer  in  gi\4ng  a  mental  test  may  intro- 
duce certain  errors  by  his  failure  to  follow  the  uniform  directions 
for  the  administration  of  the  test.  He  may  create  an  unfavor- 
able mental  attitude  on  the  part  of  the  pupils  by  hurrying  them 
or  ui'ging  them  to  be  overcautious.  In  scoring  the  results  he 
may  make  mistakes  in  using  the  key  even  with  objectiv^e  tests, 
or  show  poor  judgment  in  rating  the  specimens  in  the  case  of 
product  scales  such  as  those  for  handwriting  and  composition. 

Another  source  of  error  in  mental  measurement  is  associated 
with  what  Professor  Pearson  has  called  static  as  distinct  from 
dynamic  characters.  The  former  include  such  physical  traits  as 
height  and  weight,  the  measurement  of  which  is  direct  and  does 
not  depend  upon  the  attitude  of  the  person  at  the  time  of  ex- 
amination. Dynamic  characters  like  lung  capacity,  strength  of 
grip,  or  intelligence  must  be  measured  indirectly  by  some  form 
of  reaction,  and  therefore  depend  upon  the  bodily  or  mental 
fitness  of  the  individual.  The  measurement  of  dynamic  traits 
thus  gives  rise  to  a  variability  in  reaction  which  may  be  called 
response  error  of  the  person  tested. 

It  should  be  noted  that  when  a  pupil  has  been  examined 
several  times  on  equally  difficult  forms  of  a  test,  any  change  in 
his  response  may  be  due  in  part  to  the  attitude  of  the  examiner, 
to  imperfections  in  the  test  material,  to  practice  effect,  to  fluctu- 
ations in  emotional  status  and  fatigue,  etc.  Response  error  as 
measured  by  variation  in  score  may  thus  be  a  combination  of 
several  of  the  types  of  error  already  discussed.  Certain  formulas 
which  attempt  to  measure  response  variability  freed  from  other 
error  are  presented  in  Chapter  XIII,  section  9. 

As  pointed  out  in  the  second  section,  the  best  approximation 
to  the  true  value  of  any  quantity  is  given  by  the  average  of  a 
number  of  obser\^ations.  For  dynamic  characters  involving  re- 
sponse error  this  conception  of  true  value  may  be  misleading. 
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If  a  person  has  been  tested  ten  times  on  as  many  equivalent 
mental  scales  the  average  score  may  be  the  most  typical  one, 
but  the  highest  score  is  likely  to  be  the  best  representation  of  his 
true  ability  because  on  that  performance  there  were  fewer  in- 
terfering factors  which  prevented  him  from  doing  himself  full 
justice.  The  same  argument  might  be  made  with  regard  to 
characters  such  as  lung  capacity.  No  matter  how  often  the 
test  is  given  the  full  lung  capacity  will  never  be  registered,  and 
the  largest  volume  obtained  may  be  considered  as  nearest  the 
true  result. 

With  standardized  tests  both  the  average  (most  typical)  and 
the  highest  (nearest  the  true)  scores  will  be  useful,  the  former 
giving  the  best  prediction  as  to  future  performance,  and  the 
latter  the  best  indication  of  potential  ability  under  most  favor- 
able conditions. 

The  above  types  of  error  in  calculation  and  measurement 
may  be  briefly  summarized  as  follows : 

1.  Unbiased  or  rounding  errors  to  be  taken  into  account  in 
calculation. 

2.  Biased  errors  such  as  those  found  in  teachers'  marks. 

3.  Errors  of  the  scale : 

a.  Non-equivalent  units  or  items ; 

b.  Inadequate  sampling  of  available  material. 

4.  Errors  of  the  examiner : 

a.  In  giving  the  test ; 

b.  In  appraising  the  results  of  the  test. 

5.  Response  error  (or  variation)  of  the  examinee. 

EXERCISES 

1.  Round  off  the  following  numbers  to  four  significant  figures : 
35.675002,  846742.,  390000.,  .6744898,  .003674378. 

2.  If  the  numbers  39.2  and  18.3  are  correct  to  three  significant 
figures,  justify  the  product  717.  rather  than  700. 

Hint.   Use  maximum  and  minimum  products. 

3.  Justify  the  quotient  18.3  ^  39.2  =  0.467. 
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4.  Show  that  the  sum  of  13.26818,  138.36,  78.423,  7238.4289,  and 
6.324  cannot  be  as  large  as  7474.82  or  as  small  as  7474.79. 

5.  Show  that  the  product  of  34.68  and  4.6,  carried  to  three  digits, 
lies  between  158  and  161. 

6.  Find  the  probable  values  of  the  following : 

a.  Sum  of  27.843,  182.6,  5478.29,  and  5.2777 
h.  Difference  between  367.19  and  173.4395 

c.  Product  of  897.5  and  0.08 

d.  Product  of  37.846  and  .0004 

e.  Quotient  of  37.846  divided  by  .0004 
/.  Quotient  of  .0004  divided  by  37.846 

7.  Calculate  the  following  products,  using  Holzinger's  Tables  VI 
and  VII  and  a  five-place  logarithm  table  of  numbers.  Repeat  the 
calculations,  rounding  to  four-place  logarithms  throughout,  and  com- 
pare results. 

Answers 

a.  [1  -  (.346)2]  [1  _  (.931)2]  =  ^^73 

b.  [1  -  (.845)2]  [1  -  (.674)2]  =  i^qi 

c.  [1  -  (.113)2]  [1-  (.981)2]  ^  ,0372 

d.  V[l  -  (.639)2]  [i  _  (.846)2]  =.4101 

e.  V[l  -  (.550)2]  [1  -  (.947)2]  ,^  ,2683 
/.  V[l  -  (.600)2]  [1- (.400)2]  =  ,7332 

8.  Discuss  the  theory  of  ''most  typical"  and  "nearest  true" 
scores  given  in  section  7.  Do  you  agree  with  the  distinction  and 
use  described  by  the  author?   If  not,  why  not? 

9.  Can  the  measurement  of  mental  abilities  ever  be  made  as 
exact  as  the  measurement  of  physical  objects?   Explain. 

10.  Estimate  the  absolute  and  relative  error  made  in  measuring 
a  person's  height  with  an  ordinary  yardstick.  Estimate  the  absolute 
and  relative  error  made  in  measuring  a  person's  intelligence,  by  a 
good  group  test  and  also  by  a  good  individual  test.  Use  any  data 
available  to  assist  in  these  estimates. 


CHAPTER  VI 

AVERAGES 

1.  Introductory 

It  has  already  been  shown  that  the  first  step  in  making  a  long 
series  of  observations  comprehensible  is  to  arrange  the  data  in 
the  form  of  a  frequency  distribution.  This  enables  one  to  see 
some  of  the  more  outstanding  characteristics  of  the  series  at  a 
glance,  and  at  the  same  time  makes  subsequent  calculations 
very  much  easier  than  they  would  have  been  with  the  data 
ungrouped. 

The  hypothetical  distributions  shown  in  Fig.  21  reveal  cer- 
tain important  features  by  mere  inspection.  Curves  (1)  and 
(2)  center  about  the  value  15,  which  is  a  measure  of  type  or 
average,  but  the  first  distribution  is  spread  out  more  than  the 
second.  This  second  characteristic  is  known  as  dispersion,  or 
variability.  Distributions  (3)  and  (4)  are  said  to  be  sketved, 
the  former  negatively  and  the  latter  positively.  Curve  (5)  is 
very  steep  (leptokurtic),  whereas  (6)  is  flat-topped  (platykurtic). 
The  first  distribution,  which  is  midway  between  the  two,  might 
be  regarded  as  mesokurtic. 

All  these  characteristics  are  very  important  in  statistical 
analysis  and  they  may  all  be  quantitatively  determined  by 
appropriate  formulas  rather  than  by  inspection  of  diagrams 
as  illustrated  in  Fig.  21.  In  the  present  chapter  methods  will 
be  presented  for  the  calculation  of  several  important  averages, 
which  include  the  mean,  median,  and  mode.  Measures  of  dis- 
persion and  skewness  will  be  discussed  in  Chapter  VII.  The 
kurtosis  of  a  distribution  is  so  rarely  studied  that  no  formulas 
for  its  measurement  are  given  in  this  text.  Such  formulas,  how- 
ever, may  be  found  in  Kelley's  Statistical  Method. 
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2.  Calculation  of  the  Mean 

The  most  important  and  generally  most  reliable  average  hap- 
pens also  to  be  the  best  known.  This  is  the  arithmetical  mean. 
It  is  defined  simply  as  the  sum  of  the  values  of  the  observations 
divided  by  their  number,  or  by  the  formula 


M  =  — ) 

N 


Mean  for        "I 
ungrouped  series  j 


(5) 


where  M  is  used  to  represent  the  arithmetical  mean,  X  a  value 
of  the  variable,  and  A''  the  number  of  items.    The  symbol  2 
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Fig.  21.   Illustrating  variations  in  central  tendency,  dispersion, 
skewness,  and  kurtosis 

means  ''  the  sum  of  all  quantities  as  follows,'^  that  is,  the  sum  of 
all  the  X's.  One  property  of  the  mean  which  follows  at  once 
from  the  above  definition  is  that  it  is  the  magnitude  each  item 
would  have  if  all  items  were  the  same  size. 

The  calculation  of  the  mean  for  ungrouped  data  is  very 
simple.  It  is  only  necessary  to  add  the  items  and  divide  by  their 
number.  For  long  series,  however,  this  process  becomes  very 
tedious  and  errors  in  addition  are  likely  to  creep  in.  Calcula- 
tion from  the  frequency  distribution  therefore  becomes  almost 
imperative  with  many  items.  The  method  will  first  be  illus- 
trated by  the  use  of  a  short  series  which  has  been  so  selected 
that  the  attention  of  the  student  will  first  be  directed  to  the 
method  rather  than  to  lengthy  arithmetic.  Needless  to  say,  the 
series  is  too  short  for  the  average  to  be  of  any  practical  value. 
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Let  the  mean  of  the  following  scores  be  required  :  97,  72,  63, 
68,  93,  84,  79,  87,  56,  52,  64,  71,  75,  67,  64.  The  total  of  these 
items,  SZ,  is  1092  and  their  mean  is  72.8.  This  is  the  true  mean 
within  the  limits  of  the  accuracy  of  the  data. 

Next,  assuming  the  scores  are  correct  to  the  nearest  unit 
only,  we  shall  arrange  them  in  a  frequency  distribution  as 
follows : 

Class  Frequency 

89.5-99.5 2 

79.5-89.5 2 

69.5-79.5 4 

59.5-69.5 5 

49.5-59.5 _2 

15 

For  purposes  of  calculation  it  is  assumed  that  the  frequencies 
are  concentrated  at  the  mid-points  of  the  respective  class  in- 
tervals, such  points  being  known  as  class  values  (Chapter  II. 
section  8).  The  two  top  frequencies  will  thus  contribute 
2  X  94.5  =  189.0  to  the  total  instead  of  97  +  93  =  190,  and  so 
on  for  the  other  classes,  the  complete  calculation  being 

X  f  fX 

94.5  2  189.0 

84.5  2  169.0 

74.5  4  298.0           ,,_  2/X  _  1087.5  _^o  ^ 

64.5  5  322.5               N  "  15  " '^'^' 

54.5  _2  109.0 

15  1087.5  =  ZfX 

It  is  evident  that  the  sums  2X  and  2/X  differ  by  4.5,  a  dis- 
crepancy which  is  due  to  the  fact  that  the  frequencies  were 
taken  at  class  values  instead  of  at  observed  values.  With  a 
longer  series  and  more  class  intervals  the  above  discrepancy 
would  be  smaller,  because  the  larger  number  of  unbiased  errors 
would  tend  to  compensate,  and  with  a  narrower  interval  less 
variation  from  the  class  values  would  be  possible.  The  means, 
it  will  be  noted,  differ  by  only  0.3  in  spite  of  the  short  series 
and  coarse  classification.  It  should  also  be  observed  that  when 
X  represents  the  same  series  of  values  the  quantities  2X  and 
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2/Z  are  algebraically  the  same,  /  being  merely  a  symbol  of 
operation  showing  that  the  X's  were  added  in  frequency  groups. 
The  above  calculation  may  be  considerably  shortened  by 
selecting  an  assumed  mean,  A,  near  the  middle  of  the  series  as 
origin  and  measuring  the  variable  in  units  of  class  intervals. 
We  shall  take  these  two  steps  separately  to  show  their  individual 
effect  upon  shortening  the  calculation. 


/ 

X' 

fX' 

f 

fd 

2 

20 

40 

2 

2 

4 

2 

10 

20 

2 

1 

2 

A  =  74.5     4 

0 

0 

4 

0 

0 

5 

-10 

-50 

5 

-1 

-5 

2 

-20 

-40 

2 

-2 

-4 

15 

-  30  =  XfX' 

15 

-  3  =  2/(/ 

The  X'  series,  or  "reduced  series,"  has  been  obtained  by  sub- 
tracting 74.5  from  each  of  the  X's  in  the  preceding  illustration. 
In  order  to  obtain  the  mean  from  the  calculation  on  the  left  it  is 
necessary  to  add  74.5  to  the  mean  of  the  X'  values  since  each  has 

—  30 
been  diminished  by  that  amount,  that  is,  M  =  74.5  +  -te~  ~  '''^•^• 

It  will  be  noted  that  the  X'  values  are  replaced  in  the  work  at 
the  right  by  d  values,  which  are  obtained  by  dividing  the  X's 
by  the  width  of  the  class  interval,  h.   In  obtaining  the  mean  of 

the  whole  series,  therefore,  the  mean  of  the  rf's,  or  -rr-'  must  be 

multiplied  by  h,  before  being  added  to  the  assumed  mean,  A. 

The  work  will  then  be  M  =  74.5  +  =;P  x  10  =  72.5. 

lo 

Some  students  may  understand  the  above  method  more 

clearly  by  the  following  algebraic  proof.  From  the  definition  of 

X'  we  have 

X'  =  X-  A, 

so  that  X  =  A  +  X'. 

Furthermore,  X'  =  dh. 

Hence  X  =  A  -\-  dh. 
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Summing  over  this  expression  (or  adding  member  by  member  as 
many  equations  of  this  type  as  there  are  cases),  we  obtain 

Dividing  by  N,  factoring  out  h  (which  is  a  constant  throughout 
the  summation),  and  noting  that  2A  =  NA,  we  obtain  the  re- 
quired formula,  __      2Ar       .    .  /^fd\  r   Mean  for 


N  \  N 


h. 


^    (6) 

\  distribution  /      ^  ^ 


The  symbol  of  operation,  /,  has  been  inserted  for  convenience. 

We  shall  next  take  a  somewhat  longer  series  in  order  to  review 
the  above  procedure  and  note  a  check  on  the  work.  The  follow- 
ing scores  were  made  by  a  class  in  statistics  on  the  Otis  Self- 
Administering  Test : 

Table  13.   Illustrating  the  Calculation  of  the  Mean  with  Check 


Check 

Class 

/ 

d 

Id 

d' 

fd' 

69.5-74.5      .    . 

6 

5 

30 

4 

24 

64.5-69.5 

2 

4 

8 

3 

6 

59.5-64.5 

3 

3 

9 

2 

6 

54.5-59.5 

6 

2 

12 

1 

6 

49.5-54.5 

10 

1 

10 

0 

0 

A  =  52 

A  =  47 

44.5-49.5 

23 

0 

0 

-1 

-23 

39.5-44.5 

8 

-1 

-    8 

-2 

-16 

34.5-39.5 

4 

-2 

-    8 

-3 

-12 

29.5-34.5 

4 

-3 

-12 

-4 

-16 

24.5-29.5 

1 

-4 

-    4 

-5 

-    5 

N  =  67 

2/d  =      37 

2/d'  =  -  30 

M  =  47  +  |^X5  =  47  +  2.76  =  49.76 
M  (Check)  =  52-|^x5  =  52-  2.24  =  49.76 


In  the  first  of  the  above  calculations  for  the  mean  the  origin 
is  taken  at  47,  opposite  the  largest  frequency,  23,  because  it 
looks  as  if  this  would  furnish  a  small  2/d.  The  d's  are  then 
tabulated  1,  2,  3,  •  •  •  and  —  1,  —  2,  —  3,  •  •  •  from  this  point  and 
the  fd  products  formed.  .  The  remainder  of  the  calculation  con- 
sists in  substituting  2/d  =  37  in  formula  (6),  where  A  =  47, 
N  =  67,  and  h  =  5. 
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The  check  on  the  right  is  made  by  selecting  a  new  reference 
point  or  origin  and  repeating  the  calculation  at  least  up  to  the 
quantity  Zfd\  If  the  new  origin  differs  from  the  old  by  one 
class  unit,  'Zfd'  will  differ  from  S/d  by  N,  This  can  be  seen  by 
inspection  or  shown  as  follows : 

^Sd  =  ^Sd'±^f, 

,'.ljfd  =  l.fd'  ±N,      {Check  on  mean}         (7) 

In  the  above  example,  d'  =  d  —  1  and  2/d'  should  equal 
2/d  —  Ny  which  it  does,  checking  the  work  to  that  stage  in  the 
calculation.  The  student  is  warned  not  to  forget  to  multiply  the 
quantity  "Efd/N  by  the  width  of  the  interval  h.  Failure  to  do  so 
is  detected  by  carrying  the  check  computation  through  to  the 
final  result.  It  is  therefore  desirable  to  use  the  complete  check 
until  the  student  is  confident  of  the  accuracy  of  his  calculations. 

3.  Properties  of  the  Mean 

The  arithmetical  mean  has  several  important  properties 
which  should  be  noted.  First  of  all,  it  is  rigorously  defined  in 
algebraic  terms  and  is  based  directly  on  the  actual  values  of  all 
the  items.  This  makes  it  possible  to  obtain  a  definite  average 
for  any  quantitative  series,  and  gives  a  result  which  is  truly 
characteristic  of  the  whole  distribution. 

The  algebraic  character  of  the  mean  makes  possible  the  com- 
bination of  averages  from  several  series.  Thus,  if  Xu  -X'2,  and 
Xs  denote  the  variables  in  three  different  groups  of  size  Ni,  ^2, 

and  iV3,  the  three  means  will  be  Mi  =  -77-^'  ^2  =  tt"'  ^^^ 
M3  =  "TT^*  The  mean  of  all  three  series  is  the  sum  of  all  the  X's 

divided  by  the  total  number  of  items,  or  M= ^-^ ^^ -* 

N1  +  N2  +  NS 

This  result  may  be  obtained  from  the  individual  means  by 
multiplying  each  mean  by  the  size  of  its  group  and  dividing 
the  sum  of  these  products  by  the  total   number  of  items, 
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that  IS,  il/  =  — ^^ — XT    ,    XT    ,    XT    — ^*    This  property  might  be 

A/i  -j-  iV2  H"  iVs 

of  great  advantage  in  combining  norms  from  different  locahties, 
for  it  would  be  necessary  to  know  only  the  means  and  the 
number  of  cases  in  each  group. 

It  should  be  noted  that  the  mean  of  several  means  is  rarely 
the  average  of  the  items  on  which  the  separate  means  are  based. 

This  can  be  seen  by  a  very  simple  example.    Let  Mi  =  -  =  3, 

Mo  =  -z-  =  4,  and  Ms  =  -—  =  5.     The  mean  of  all  the  items 

is  7^ — y  or  4.1,  while  the  mean  of  the  three  means  is 

34. 4  _!_  5 

'  >  or  4.0.    Both  results  are  entirely  correct,  but  repre- 

sent quite  different  things.  The  reason  for  the  discrepancy 
may  be  seen  by  noting  that  if  the  three  means  are  different, 

N1  +  N2  +  .V3  ^  3  -^ 

the  three  N's  are  equal.  Since  it  is  usually  the  mean  of  the  items 
which  is  required,  the  averaging  of  averages  should  ordinarily 
be  avoided. 

Another  property  of  the  mean  appearing  from  the  definition 
is  that  every  item,  large  or  small,  contributes  its  proportionate 
share  to  the  result.  This  is  regarded  by  some  as  a  defect  since 
extreme  items  appear  to  have  an  undue  effect  upon  the  mean. 
Against  this  objection  it  might  be  argued  that  if  such  extreme 
observations  belong  in  the  series,  they  should  be  permitted  to 
contribute  their  full  share. 

The  algebraic  properties  of  the  mean  are  of  further  impor- 
tance in  mathematical  statistics  where  this  average  enters  as 
a  parameter  in  many  formulas.  It  is  probably  as  valuable  in 
this  respect  as  any  other  statistical  constant. 

A  final,  and  in  some  respects  the  most  valuable,  property  of 
the  mean  is  its  stability  under  fluctuations  of  sampling  for  or- 
dinary distributions.   If  siimples  are  drawn  from  a  large  body 
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of  material  and  a  number  of  means  calculated,  they  will  usually 
be  closer  to  the  mean  of  the  whole  material  than  if  any  other 
average  had  been  employed.  This  property  is  often  character- 
ized as  the  reliability  of  the  mean  (.see  Chapter  XIII). 

4.  Calculation  of  the  Medl\n 

The  median  for  ungrouped  series  has  already  been  introduced 
in  connection  with  the  classilier  described  in  Chapter  II.  It  is 
the  middlemost  value  of  the  variable  when  the  observations 
are  ranked  in  order  of  size,  or  the  magnitude  such  that  greater 
and  smaller  values  occur  with  equal  frequency.  For  an  odd 
nimiber  of  observations  without  ties  in  rank,  it  is  clearly  the 
magnitude  of  the  middle  observation.  For  an  even  number  of 
cases  any  value  between  the  two  middle  items  will  satisfy  the 
above  definition,  but  it  is  customary  to  take  as  the  median  the 
average  of  the  two  middle  values.  In  case  there  are  ties  in  rank 
near  the  middle  of  the  series,  a  weighted  avemge  is  sometimes 
used  as  illustrated  by  the  following  observations :  1.  3,  5,  9, 
10.  12.  12,  12,  14.  14.  15,  16,  17,  21.  23,  25.  The  value  halfway 
between  12  and  14  might  serve  as  the  median,  but  the  weighted 
mean  of  the  middle  observations  would  seem  to  give  a  Uttle 
more  stable  result.   The  median  in  this  example  is  thus 

3  X  12  +  2  X  14  ^  ^._,^ 
5 

When  there  are  a  sufficient  number  of  obsenations  to  warrant 
the  use  of  a  frequency  distribution,  the  above  difficulties  do  not 
arise.  The  histogram  of  the  Otis  scores  from  Table  13  will  illus- 
trate the  procedure  in  this  case.  Under  this  representation  the 
frequencies  are  assimied  to  be  spread  evenly  over  the  class  in- 
tervals, the  areas  being  exactly  proportional  to  the  number  of 
items  between  any  two  class  limits.  The  median  is  now  to  be 
regarded  as  the  value  of  the  variable  on  either  side  of  which  half  the 
frequencies  lie.  The  graphical  solution  amounts  to  determining 
the  point  on  the  scale  the  vertical  through  which  bisects  the 
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area  under  the  histogram.  It  is  thus  only  necessary  to  count  in 
the  frequencies  from  either  end  and  interpolate  across  the  in- 
terval containing  the  median. 

From  Fig.  22  it  will  be  noted  that  half  of  the  frequencies  is 
33.5,  so  that  the  problem  is  to  determine  the  point  above  and 
below  which  33.5  frequencies  lie.    Counting  up  from  the  lower 

end  of  the  scale  it  is 
apparent  that  17  fre- 
quencies lie  below  44.5 
and  40  frequencies  lie 
below  49.5.  The  me- 
dian therefore  lies  some- 
where between  these 
two  values.  The  differ- 
ence 33.5  -  17  =:  16.5 
gives  the  number  of  fre- 
quencies beyond  44.5 
necessary  to  reach  the  median.   From  the  rectangles  in  the  dia- 

or       1  fi  J 
gram  it  is  apparent  that  -  =  -p^  >  so  that  the  required  distance, 

16  5  ^ 

X,  is  ^  X  5  =  3.6.   The  median  is  therefore  44.5  +  3.6  =  48.1. 

The  work  may  be  checked  by  counting  down  from  the  upper  end 

(33.5  -  27) 


5- 


25  30  35  40  45 


10 


50  55   60 


Md. 


65   70   75 
Otis  score 


Fig.  22.  Illustrating  the  median  for  the  scores 
on  the  Otis  Self-Administering  Test 


of  the  scale,  giving  median  =  49.5 


23 


X  5  =  48.1. 


Using  certain  abbreviations,  we  may  write  two  formulas  for 
calculating  the  median  in  the  case  of  the  frequency  distribution. 
The  term  median  interval  is  used  to  designate  the  class  interval 
which  contains  the  median.   Let 

u.  I.  and  1. 1.  =  upper  and  lower  limits  of  median  interval, 
for  example,  49.5  and  44.5  in  Fig.  22, 

fup  and  fdo  =  total  frequency  up  to  and  down  to  median  in- 
terval, for  example,  17  and  27, 

fmd  =  frequency  of  median  interval,  for  example,  23, 

h  =  width  of  class  interval,  and 

Md  =  median. 
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The  formulas  then  become 


r  Median  for 


^ 


Md  =  /./.+  \ — /  h       <  distribution  Y      (8  a) 

'^^  [^  counting  up  J 


^  I  Median  for  dis-1 

and  Md  =  U,  l.  —  \^^ /  h.  <  tribution  count-  >  (8b) 

[       ing  down       J 

If  the  student  finds  it  easier  to  do  the  calculation  by  a  series 
of  steps,  the  following  may  be  useful : 

1.  Divide  the  number  of  cases  by  2,  ( —  =  33.5  ] . 

2.  Determine  by  inspection  the  interval  containing  the 
median,  (44.5  —  49.5). 

3.  Count  the  frequencies  up  to  the  median  interval,  (/„p=17). 

4.  Subtract  this  last  result  from  ^'  (33.5  —  17  =  16.5). 

5.  IMultiply  the  last  result  by  the  width  of  the  interval  and 
divide  by  the  number  of  frequencies  in  the  median  interval, 

16.5  X  5        r,  n\ 

6.  Add  this  quantity  to  the  lower  limit  of  the  median  interA^al, 
thus  obtaining  the  required  median,  (44.5  -h  3.6  =  48.1). 

A  similar  series  of  steps  may  be  written  out  for  the  calcula- 
tion when  counting  down  from  the  upper  end  of  the  scale. 

It  has  been  noted  that  the  median  determines  the  point  on 
the  horizontal  scale  the  vertical  through  which  bisects  the  area 
of  the  histogram.  The  mean,  on  the  other  hand,  is  the  point  at 
which  the  histogram  would  balance.  It  is  the  center  of  gravity 
of  the  distribution.  The  fd's  correspond  to  the  moments  in 
physics  (force  x  distance),  and  the  mean  or  center  of  gravity 
occurs  where  3/d  =  0. 

The  table  on  page  88  shows  the  complete  calculation  of  the 
meat!  and  median  for  a  longer  series.  It  will  be  noted  that 
the  frequencies  are  given  at  central  ages  45,  44,  etc.,  or  classes 
44.5-45.5,  43.5-44.5,  etc.,  since  all  ages  were  tabulated  to  the 
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nearest  year.  The  data  show  the  ages  at  which  a  group  of 
college  professors  listed  in  Who's  Who  received  their  Ph.D. 
degrees.  All  these  men  had  an  A.B.  but  none  an  A.M.  degree. 


Table  14.    Illustrating  the  Calculation  of  the  Mp:an  and  Median 


Age  Re- 

f 

d 

fd 

Check 

ceived  Ph.D. 

J 

d' 

fd' 

46 

r  3 

16 

48 

17 

51 

44 

— 

15 

— 

16 

— 

43 

3 

14 

42 

15 

45 

42 

3 

13 

39 

14 

42 

41 

1 

12 

12 

13 

13 

1  O 

40 

5 

11 

55 

12 

60 

M  =  29  +  ~^'^x  1=28.97 

39 

190 

9 

10 

90 

11 

99 

400 

38 
37 

cases 
above « 

5 
5 

9 

8 

45 
40 

10 
9 

50 
45 

Af  =  28  -}-  ^^"^'"^  X  1  =  28.97 
400 

36 

Md. 

7 

7 

49 

8 

56 

35 

Int. 

7 

6 

42 

7 

49 

34 

10 

5 

50 

6 

60 

33 

13 

4 

52 

5 

65 

32 

17 

3 

51 

4 

68 

1  fV 

31 

29 

2 

58 

3 

87 

Md  =  27.5  +  —X  1=28.13 

30 

42 

1 

42 

2 

84 

27 

29 

.31 

0 

— 

1 

31 

28 

27 

-1 

-2 

-27 

-74 

0 
-  1 

0 
-37 

Md  =  28.5-i5x  1=28.13 

27 

'37 

27 

26 

183 
cases 
below  - 
Md. 
Int. 

54 

-3 

-162 

_  2 

-108 

'^ 

25 
24 
23 

38 
29 
14 

-4 
-5 
-6 

-  152 

-145 

-84 

-3 
-4 
-5 

-114 

-116 

-70 

22 

7 

-7 

-49 

-6 

-42 

21 

2 

-8 

-16 

-7 

-14 

20 

.  2 

-9 

-18 
-12 

-8 

-16 

388~ 

400 

5.  Properties  of  the  Median 

The  lack  of  rigor  in  the  definition  of  the  median  for  undis- 
tributed series  has  already  been  noted,  and  in  this  respect  the 
mean  is  cleiirly  superior.  For  large  bodies  of  data,  however,  in 
which  the  use  of  the  frequency  distribution  becomes  imperative, 
no  difficulties  as  to  rigid  definition  are  likely  to  arise.* 

♦  Note  that  the  median  for  grouped  data  beoonies  indeterminate  when  llie  fre- 
quency of  the  median  interval  is  zero.   This  form  o{  distribution,  however,  is  very  rare. 
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The  median  b  based  only  indirectly  xspim  all  of  the  ob^r- 
vatMHis  inasmuch  as  it  is  detennined  by  their  relative  sdze. 
\Mieth«^  or  not  this  is  an  advantage  over  the  mean  depends 
upon  the  particular  purpose  few  iHiich  the  average  is  used. 
Under  ordinary  eircimistaiices  all  items  should  contribute  fully 
if  included  at  all  and  the  mean  is  therefore  generally  supmor. 

In  combining  the  averages  of  several  series  the  mean  has  a 
great  advantage  over  the  median.  A  simple  combination  of  the 
separate  means  and  totals  as  shown  in  section  3  will  furnish  the 
mean  d  the  «itire  group  of  items,  while  in  ord«r  to  determine 
the  grand  median  it  is  necessary-  to  combine  all  of  the  separate 
distributions  into  one  and  calculate  from  this.  As  regards  other 
algelwaic  properties  the  median  is  again  inferior  since  it  cannot 
be  employed  in  connection  with  the  formulas  of  higher  statisti- 
cal analysis. 

The  reliability  of  the  median,  or  its  stability  under  fluctua- 
tions of  sampling,  is  in  gso&rsl  less  than  that  of  the  mean.  Only 
for  very  peaked  or  leptokurtic  distribiitions  of  the  type  illus- 
trated in  (5)  of  Fig.  21  is  the  median  superior  in  this  respect,* 

The  ad\'antage  thus  far  appears  to  be  entirely  in  favor  of  the 
mean,  but  the  median  has  at  least  two  points  of  superiority.  It 
is  easier  to  calculate  for  both  long  and  short  series,  and  in  the 
case  of  ungrouped  data  the  middle  item  which  furnishes  the 
[  median  can  be  uniquely  identified  and  will  r»nain  the  median 
item  und«r  any  other  form  of  measurement.  Thus  the  height 
of  the  ele%-enth  man  in  a  group  of  twenty-one  is  t>-pical  of  ;\ll  in 
a  very  real  sense,  while  the  mean  of  the  series  will  ver>'  probably 
not  correspond  to  the  height  of  any  particular  indi\idual. 

For  the  large  bulk  of  test  data  the  norms,  or  average  scores 
for  unselectevi  groups,  are  given  in  the  form  of  the  medians.  In 
using  such  tests  and  in  making  comparisons  it  is  therefore  neces- 
sary' to  use  this  form  of  average.  For  most  problems,  however, 
the  mean  is  distinctly  superior  and  should  be  used  unless  there 
is  some  ver>'  good  reason  to  the  contrary. 

•G- 1\  Yule.  iBtroduetioa  lo  Stat»tkrs.  p.  539.   C.  Gri*n  Jfe  Co..  London. 
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6.  The  Crude  Mode 

The  modal  value  of  a  variable  is  the  value  of  the  most  fre- 
quent occurrence.  Thus  in  P'ig.  21  the  modes  are  the  abscissas 
corresponding  to  the  highest  points  of  the  curves.  For  grouped 
series  it  is  possible  to  obtain  only  a  crude  mode,  which  may  be  de- 
fined as  the  class  value  of  the  group  with  the  largest  frequency. 

The  crude  mode  is  obviously  unstable  inasmuch  as  it  will  de- 
pend upon  the  fineness  of  classification  used  in  grouping  the 
data.  By  widening  or  narrowing  the  class  interval,  the  mode 
may  be  made  to  shift  very  considerably  up  or  down  the  scale. 
It  is  therefore  to  be  used  only  for  rough  inspectional  purposes. 
Its  great  advantage,  of  course,  lies  in  the  fact  that  it  can  be 
determined  at  a  glance. 

The  following  distribution  shows  two  crude  modes  for  the 
A.B.  to  A.M.  spans  of  a  group  of  college  professors.  The  spans 
or  years  elapsing  between  degrees  are  again  given  at  class  values. 


In  the  first  frequency  distribution  given  by  /,  the  crude  modes 
appear  at  one  and  at  three  years.  Grouping  by  two-year  in- 
tervals brings  a  single  mode  at  two  and  one-half  years.  The  two 
crude  modes  are  of  greater  practical  interest  in  this  example 
because  they  show  that  if  a  graduate  student  fails  to  get  his 
master's  degree  in  one  year,  he  will  very  likely  take  three  years 
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That  this  result  should  follow  appears  at  once  from  the  more 
general  form  of  the  geometrical  series  a,  ar,  ar^,  ar^y  ar^,  etc., 
the  G.M.  of  the  first  five  terms  being 


45 


40 


85 


30 


25 


20 


15 


10 


Va  •  ar  •  ar^  •  ar'^  ■  ar^  =  y/a^r^  =  ar^, 

or  the  middle  term,  which  is  18  in  the  example  on  page  91. 

The  arithmetic  mean  of  the  above  items  is  21.1.  In  order  to 
show  the  relationship  between  these  two  means,  they  have  been 
plotted  with  the  data  in  Fig.  23.    As  a  general  average  of  the 

five  numbers  in  the  series 
the    arithmetic    mean    is 
quite  adequate,  but  as  a 
measure    of    the    average 
item  in  such  a  trend  the 
geometric  mean  only  is  cor- 
rect.  This  is  further  illus- 
trated by  the  means  of  the 
first  four  terms.  The  A.M. 
is  now  16.25  and  the  G.AL 
mean  14.70,  the  latter  being 
again  a  point  on  the  smooth 
curve  connecting  the  items 
in  the  series. 
The  geometric  mean  is  thus  useful  in  determining  averages  in 
historical  trends  where  the  items  form  something  like  a  geo- 
metrical series.   The  cost  data  in  Table  15  furnish  an  example 
of  this  sort. 

From  the  nature  of  a  geometric  series  it  is  apparent  that  the 
ratio  of  each  term  to  the  one  just  preceding  is  equal  to  the  con- 
stant ratio  r.  Applying  this  test  to  the  cost  data,  a  series  of 
fairly  equal  ratios  is  found,  showing  that  the  original  items 
form  approximately  a  geometrical  progression.  In  determin- 
ing the  expenditure  at  any  point  in  this  trend,  the  geometric 
mean  should  therefore  be  used.  The  total  yearly  costs  have 
been  taken  at  the  mid-year  points,  so  that  if  the  cost  from  the 
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Fig.  23.   Illustrating  the  geometric  and 
arithmetic  means  for  four  and  five  items 
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Table  15.   Cost  Data  Illustrating  the  Use  of  the  Geometric  Mean 


Year 

Expenditure 

FOR  Public 

Schools  in  the 

United  States 

(in  Millions) 

Ratio  of 

Each  Item 

TO  One  above 

Theoretical 
Series 

1901-1902    

1902-1903    

1903-1904    

1904-1905    

190&-1906    

1906-1907    

1907-1908    

1908-1909    

1909-1910   

1910-1911    

1911-1912    

1912-1913    

1913-1914    

1914-1915   

1915-1916   

1916-1917    

1917-1918    

1918-1919    

227.5 
238.3 
252.8 
273.2 
291.6 
307.8 
336.9 
371.3 
401.4 
426.3 
446.7 
482.9 
521.5 
555.1 
605.5 
640.7 
702.2 
763.7 

1.047 
1.061 
1.081 
1.067 
1.056 
1.095 
1.102 
1.081 
1.062 
1.048 
1.081 
1.080 
1.064 
1.091 
1.058 
1.096 
1.088 

230.0 
246.9 
265.1 
284.8 
305.8 
328.5 
352.8 
378.9 
406.9 
437.0 
469.3 
504.1 
541.4 
581.4 
624.5 
670.7 
720.3 
773.6 

A.M.  =  435.9 
G.M.  =  406.9 


A.M.  =  1.0740 
G.M.  =  1.0739 


middle  of  1909  to  the  middle  of  1910  were  required  it  could  be 
approximated  by  finding  the  geometric  mean  of  401.4  and  426.3, 
or  Vl71, 116.82,  which  is  413.7,  or,  since  we  are  figuring  in  mil- 
lions of  dollars,  $413,700,000. 

For  the  entire  series  the  A.M.  is  435.9  and  the  G.M.  406.9, 
while  for  the  set  of  accompanying  ratios  the  arithmetic  and 
geometric  means  agree  at  1.074.  The  correct  method  of  averag- 
ing such  ratios  is  by  the  geometric  mean,  but  in  the  above  ex- 
ample there  is  very  close  agreement  between  the  two  averages 
because  of  the  even  nature  of  the  series. 

The  theoretical  series  given  in  Table  15  was  obtained  by  form- 
ing a  geometric  progression  with  a  =  406.9  and  r  =  1.074,  and 
extending  it  in  both  directions  from  the  beginning  of  the  year 
1910.   It  may  be  noted  that  an  error  in  the  fourth  place  of 
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these  numbers  may  occur  because  of  llie  cumulative  nature  of 
the  errors  in  the  powers  of  r. 

From  the  very  good  agrwment  between  the  observed  and 
theoretical  cost  trends  it  is  further  apparent  that  the  expendi- 
ture data  form  a  good  approximation  t(i  a  geometrical  series. 
From  the  vears  1901   to   1918   the  cost  rose  an   averaue  of 
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7.4  per  cent  each  year  over  the  one  just  pre<*eding.  The  char- 
acter of  this  increase  is  the  s;une  as  that  of  a  sum  of  money  out 
at  compound  interest. 

The  close  agreement  noted  above  might  tempt  one  to  extend 
the  curve  and  predict  future  costs.  Thus  the  expenditure  in 
192:2  might  have  been  forecasted  as  (40().9H1.074)i--\  This 
gives  993.2  as  compared  with  an  actual  cost  expenditure  of 
1,580.7  millions.  The  large  discrepancy  is  due  chietly  to  the  in- 
fluence of  post-war  conditions  upon  the  purchasing  power  of 
money.  In  making  the  prediction  it  was  assumed  that  the  s:mie 
factors  influencing  expenditures  from  1901  to  1918  would  con- 
tinue to  operate  in  1922.  This  assumption,  as  we  have  seen,  is 
not  valid. 


The  hanaocxnk  mean  of  a  seiies  of  obsefratXHis  k  t^ 
of  the  arithmetic  mean  of  their  redprocals^  or  if  H  (or  H.M.)  be 

t'  "nonic  r-.ear. 

1       1      ^1\ 

—  =  —  r    -    •       iHamonic  Btmn}      (11) 

For :  r  r  >v  '^.  v  s  S .  IJ  1 S  J  "   -    "  the  harmonic  mean  will  th"  ?  be 
^'^^^        1      lA      1       1       1        1\ 


=  j.-:">. 


The  work  can  be  done  y&j  readilj  using  a  taUe  of  redprocats. 

The  three  n  :^~  itqb  far  the  above  data  may  now  be  written 

H.J/.  =  15  4, 

A.M.  =  2L1, 

w^:oh  is  the     ■  '        '  -r'-'i  .■^•v.-..  ':-.^  .^s  .'v  by 
u-x.s  in  algdtfu. 

The  harmonic  mean  may  be  iUustrated  by  a  supposititious 
proUan.   Let  us  assume  that  five  pi;p''>  ^  -  on 

SMneprobleTT^s.  w^*:^  '^r-^r?"'*:?  >{•*:  c^:-:^.-^-  •-  :■■ ,  .. :     >  -.^  ..    ,  .v>: 

10  )  € 


8 

C 
4 

Hr      4.38 


T,5 

10 

15 

80 

Aft    18.T 

Ht    10 


If  r  and  Mr  doiote  the  rate  and  arithmetic  mean  rate  at  which 
the  problems  were  worixd,  and  t  and  H^  represait  the  time  and 
harmonic  mean  time  in  minutes  required  to  work  a  proUon,  it 
is  e\ident  that 
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and 


or 


Similarly, 


'  =  '''' to^ 

13.7  =  60  X 


4.38. 


The  arithmetic  mean  rates  and  mean  times  may  therefore  be 
obtained  by  determining  the  reciprocals  of  the  harmonic  mean 
times  and  rates  and  multiplying  by  the  proper  constant. 

Two  experimenters,  for  instance,  might  have  recorded  their 
results,  one  as  rate  and  one  in  time,  but  by  using  the  above 
relationships  their  averages  could  be  made  directly  comparable. 
Thus  in  the  above  example  the  mean  rate,  6,  may  be  found 
by  the  arithmetic  mean  of  the  rates  or  by  dividing  60  by  the 
harmonic  mean  of  the  corresponding  times. 

In  general,  if  the  test  results  are  recorded  as  rates,  Mr  should 
be  employed ;  but  if  times  are  recorded,  Mt  should  be  used. 
The  corresponding  harmonic  means  are  chiefly  useful  in  making 
results  comparable  when  necessary. 


EXERCISES 


1.  Calculate 

the 

mean 

and   median   for 

each  of  the 

follov 

jquency  distributions : 

(1) 

(2) 

(3) 

Class 

/ 

Class 

/ 

Class  Value 

/ 

94.5-99.5 

1 

36.5-38.5 

1 

10 

1 

89.5-94.5 

2 

34.5-36.5 

- 

9 

2 

84.5-89.5 

3 

32.5-34.5 

3 

8 

5 

79.5-84.5 

5 

30.5-32.5 

4 

7 

10 

74.5-79.5 

7 

28.5-30.5 

10 

6 

12 

69.5-74.5 

6 

26.5-28.5 

4 

5 

10 

64.5-69.5 

4 

24.5-26.5 

3 

4 

4 

59.5-64.5 

- 

22.5-24.5 

2 

3 

3 

54.5  59.5 

1 

20.5  22.5 

2 

2 
1 

2 

M  =  77.52 

M  =  28.81 

M  = 

5.86 

Md  =  77.00 

Md  =  29. 

20 

Md  = 

5.96 

AVERAGES 

(4) 

(5) 

Class 

/ 

Class  Vall-e             / 

89. .5-99. 5 

1 

95                     1 

79. .5-89. 5 

2 

85 

69. .5-79. 5 

5 

75                      3 

59.5-69.5 

20 

65 

49.5-59.5 

16 

55                     5 

39. .5-49. 5 

4 

45                      6 

29. .5-39. 5 

5 

35                      7 

19..5-29.5 

2 

25 

9.5-19.5 

1 

15                      4 
5                     2 

M  =  57 

.36 

M  =  42.14 

Md  =  59 

..50 

A/cf  =  41.67 

97 


(6; 

Class 

40.5-43.5 
37.5-40.5 
34.5-37.5 
31.5-34.5 
28..5-31.5 
25.-5-28.5 
22.. 5-2  5. 5 
19.5-22.5 
16.5-19.0 


/ 

1 
2 
5 
6 

i 

10 
4 
3 
2 


M  =  29.325 

Md  =  28.93 


^7) 


Cl^ss 

/ 

34.95-39.95 

1 

29.9.5-.34.95 

3 

24.9.5-29.95 

( 

19.9.5-24.95 

10 

14.9.5-19.95 

4 

9.9.5-14.95 

2 

4.95-9.95 

2 

3/  =  22.79 

Md  =  23.20 

^8j 

CL.4SS  f 

10.25-11.25  2 

9.25-10.25  2 

8.25-9.25  4 

7.2.5-8.25  7 

6.25-7.25  8 

5.25-6.25  4 

4.2.5-5.25  2 

3.2.5-4.25  1 

M  =  7.35 
Afd  =  7.25 


(9) 

Cl-ass  Value 

11.5 

10.5 

9.5 

8.5 
7.5 
6.5 
5.5 
4.5 


1 
2 

4 

5 
6 
7 
4 
2 


3/  =  7.56 
3/d  =  7.42 


2.  Calculate  the  means  and  medians  for  the  data  of  Exercise  1, 
Chapter  II,  using  class  inter\'als  of  10  for  the  Otis  and  Terman  tests, 
and  an  interval  of  5  units  for  the  Chicago  test. 


Ons 

Chicago 

Terman 

Mean 

1.39.3 
140.6 

53.75 
53.64 

124.5 

Median 

124.5 

Ans. 

3.  Verify  the  means  and  medians  for  the  distributions  of  the  Army- 
Alpha  Test  given  on  pages  98  and  99.  The  class  values  were  taken 
at  207.5,  202.5,  etc.,  which  makes  the  averages  .5  larger  than  they 
would  have  been  if  the  intervals  had  been  given  as  204.5-209.5,  etc. 
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Variables  :  Alpha  Score  x  Schooling.  Group  I,  II,  III 

(Native  Born)* 

For  men  who  took  alpha  only. 


White  Draft 


Grades 

High  School 

Alpha  Score 

7 

8 

1 

2 

3 

4 

205-209  

- 

— 

1 

_ 

_ 

_ 

200-204  

- 

- 

- 

- 

- 

- 

195-199  

- 

- 

1 

- 

1 

- 

190-194  

- 

- 

- 

- 

2 

3 

185-189  

- 

1 

2 

3 

2 

9 

180-184  

- 

2 

2 

6 

4 

15 

175-179  

2 

5 

4 

6 

4 

15 

170-174  

2 

5 

6 

10 

11 

22 

165-169  

3 

8 

8 

7 

10 

38 

160-164  

- 

12 

18 

12 

10 

48 

155-159  

2 

15 

22 

20 

24 

56 

150-154  

7 

29 

36 

30 

29 

63 

145-149  

7 

48 

27 

42 

34 

96 

140-144  

7 

62 

44 

41 

41 

98 

135-139  

15 

76 

55 

57 

46 

106 

130-134  

19 

108 

73 

85 

69 

130 

125-129  

17 

159 

86 

89 

62 

120 

120-124  

24 

164 

92 

94 

74 

121 

115-119  

36 

249 

113 

129 

80 

148 

110-114  

52 

309 

136 

126 

91 

151 

105-109  

66 

384 

173 

146 

83 

140 

100-104  

97 

430 

168 

174 

97 

135 

95-99  

141 

523 

199 

148 

95 

135 

90-94  

170 

624 

209 

174 

97 

105 

85-89  

187 

661 

230 

201 

107 

110 

80-84  

247 

756 

232 

167 

84 

103 

75-79  

326 

811 

248 

137 

82 

87 

70-74  

378 

914 

238 

165 

78 

81 

65-69  

385 

957 

225 

131 

57 

70 

60-64  

499 

989 

246 

146 

64 

57 

55-59  

594 

1,057 

178 

114 

38 

33 

50-54  

611 

996 

161 

95 

21 

32 

45-49  

650 

937 

143 

73 

24 

18 

40-44  

660 

845 

107 

55 

24 

21 

35-39  

638 

706 

88 

49 

27 

14 

30-34  

636 

642 

80 

36 

13 

16 

25-29  

511 

461 

45 

31 

12 

11 

20-24  

380 

281 

27 

12 

8 

7 

15-19  

231 

189 

10 

16 

4 

8 

10-14  

54 

59 

1 

7 

4 

- 

5-9   

44 

34 

1 

2 

1 

1 

0-4   

3 

10 

1 

2 

- 

- 

Total 

7,701 

14,518 

3.736 

2.838 

1.614 

2.423 

M 

53.874 

68.287 
65.277 

83.842 

90.366 

98.823 

109.881 

Md 

60.356 

81.487 

89.502 

98.263 

110.911 

♦  Data  from  Memoirs  of  National  Academy  of  Sciences,  Vol.  XV,  p.  748. 
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Variables  :  Alpha  Score  x  Schooling.   Group  I,  II,  III :  White  Draft 

(Native  Born)*  (Continued) 

For  men  who  took  alpha  only. 


College 

Alpha  Score 

1 

2 

3 

4 

5 

6 

205-209  

_ 

_ 

_ 

_ 

_ 

_ 

200-204  

- 

- 

1 

2 

- 

— 

195-199  

2 

1 

1 

5 

1 

1 

190-194  

1 

2 

2 

5 

- 

1 

185-189  

4 

4 

4 

17 

2 

1 

180-184  

2 

3 

2 

19 

1 

_ 

175-179  

9 

8 

6 

23 

2 

1 

170-174  

9 

13 

12 

36 

5 

1 

165-169  

16 

14 

14 

38 

1 

2 

160-164  

25 

18 

17 

44 

4 

3 

155-159  

22 

29 

28 

39 

2 

1 

150-154  

29 

34 

31 

47 

5 

1 

145-149  

33 

31 

28 

53 

6 

2 

140-144  

26 

26 

18 

41 

3 

- 

135-139  

51 

37 

21 

29 

3 

2 

130-134  

42 

50 

27 

29 

1 

1 

125-129  

62 

38 

29 

35 

4 

1 

120-124  

48 

43 

20 

34 

2 

1 

115-119  

44 

58 

28 

42 

1 

- 

110-114  

65 

48 

28 

28 

- 

1 

105-109  

52 

36 

23 

21 

6 

2 

100-104  

44 

33 

22 

17 

1 

- 

95-99     

58 

25 

24 

26 

2 

- 

90-94     

47 

51 

25 

17 

- 

_ 

85-89     

55 

35 

21 

10 

- 

_ 

80-84     

41 

37 

17 

11 

1 

_ 

75-79     

39 

29 

9 

6 

3 

2 

70-74     

45 

22 

12 

4 

1 

2 

65-69     

29 

29 

13 

13 

- 

_ 

60-64     

41 

17 

8 

2 

- 

_ 

55-59     

38 

16 

11 

2 

- 

_ 

50-54     

16 

12 

6 

3 

- 

_ 

45-49     

23 

11 

8 

3 

2 

_ 

40-44     

11 

7 

3 

- 

1 

_ 

35-39     

17 

3 

1 

1 

— 

_ 

30-34     

3 

2 

2 

2 

_ 

_ 

25-29     

2 

3 

- 

2 

— 

_ 

20-24     

1 

4 

1 

_ 

— 

_ 

15-19     

4 

- 

_ 

_ 

_ 

_ 

10-14     

- 

- 

_ 

1 

_ 

_ 

5-9       

- 

- 

— 

— 

_ 

_ 

0-4       

- 

- 

- 

- 

- 

- 

Total 

1,056 

829 

523 

707 

60 

26 

M 

105.559 

112.168 

118.877 

136.581 

134.417 

140.0 

Md 

106.346 

114.427 

119.911 

141.890 

143.333 

147.5 

*  Data  from  Memoirs  of  National  Academy  of  Sciences,  Vol.  XV,  p.  748. 
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4.  Calculate  the  geometric  mean  for  the  following  cost  data : 

Total  Expenditure  for 
Year  Schools  in  United 

States  Relative  to  1914 

1914 100 

1916 115 

1918 138 

1920 187 

1922 285 

(G.M.  =  153) 

Taking  a  =  100  and  ar*  =  285,  compute  r  and  construct  a  geometri- 
cal series  of  five  terms  (r  =  1.30) .  Compare  with  the  data.  Should  you 
conclude  that  the  cost  increased  in  geometrical  progression  during 
this  period  ? 
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CHAPTER  VII 


MEASURES  OF  DISPERSION 


1.  Introductory 


The  dispersion  of  a  series  of  observations  is  the  degree  of 
scatter,  or  the  extent  to  which  the  items  are  spread  out  along 
the  scale  from  some  average  value.    It  is  important  to  have 
measures  of  such  va- 
riability  for    several 
reasons,  one  of  them 
arising  from  its  rela- 
tion to  the  reliability 
of  the  average. 

In  Fig.  25  two  dis- 
tributions with  the 
same  number  of  cases 
and  the  same  average 
are  shown.  In  the 
case  of  curve  (a)  the 
observations  cluster 
closely    around    the 

mean,  while  in  curve  (6)  they  are  spread  out  much  more  along 
the  scale.  It  is  therefore  apparent  that  the  average  of  the  first 
distribution  is  more  representative  of  the  whole  series,  more 
typical  of  all  the  observations,  and  for  the  same  number  of 
items  to  be  regarded  as  the  more  reliable.  In  comparing  two 
or  more  averages  it  is  necessary  to  have  some  measure  of  their 
respective  reliabilities,  and  for  this  purpose  a  numerical  repre- 
sentation of  the  dispersion  of  the  series  is  first  required.  Appro- 
priate reliability  formulas  for  averages  and  other   statistical 

quantities  will  be  found  in  Chapter  XIII. 

101 


Fig.  25.    Illustrating  difference  in  dispersion  for 
two  series  with  the  same  mean 
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Another  need  for  a  measure  of  dispersion  arises  when  a  general 
measure  of  homogeneity  is  required.  In  teaching,  for  example, 
it  is  well  known  that  when  the  students  in  a  class  differ  widely 
in  previous  training  and  mental  characteristics,  instruction  be- 
comes a  very  difficult  problem.  The  measurement  of  the  abili- 
ties involved  and  the  quantitative  appraisal  of  the  variation  in 
different  groups  make  it  possible  to  approach  such  a  problem 
scientifically.  Closely  related  to  this  problem  is  the  question 
whether  or  not  uniform  instruction  tends  to  bring  a  class  up  to 
a  common  level  of  attainment,  or  brings  about  a  still  further 
differentiation  in  ability.  These  questions  can  be  answered  best 
by  making  use  of  some  measure  of  dispersion. 

Other  uses  of  group  variability  appear  in  connection  with 
problems  in  the  overlapping  of  pupil  abilities  and  in  the  stand- 
ardization of  tests,  as  a  parameter  in  higher  statistical  analysis, 
and  as  a  common  unit  of  measure  in  comparing  performances 
on  unlike  scales.  This  last  use  will  be  discussed  in  section  7. 

2.  Mean  Deviation 

One  of  the  simplest  measures  of  dispersion  is  the  mean  devia- 
tion or  variation  of  the  observations  about  some  central  tendency 
such  as  the  arithmetic  mean  or  median.  The  computation  will 
first  be  illustrated  by  a  short  ungrouped  series.  If  x  denotes  the 
variation  of  an  observation  X  from  the  mean  M,  then  x  =  X—  M 
and  the  original  values  and  deviations  for  five  scores  may  be 
set  down  as  follows : 


X 

X 

21 
19 
17 
14 
12 
M  =  16.6 

+  4.4 
+  2.4 
+  0.4 
-2.6 
-  4.6 
14.4  =  S|x| 

It  will  be  observed  that  the  algebraic  sum  of  the  deviations 
X  is  zero  in  the  above  problem.    This  may  be  generally  shown 


r 


Ki 


Jt^-^l>  \    -  c 
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by  noting  that  if  Z'  =  X  —  A,  then  ——  =  M  —  A,  so  that  the 

sum  of  the  deviations  X'  vanishes  when  M  =  A,  In  securing 
a  measure  of  average  variation  it  is  therefore  necessary  to  ehmi- 
nate  the  algebraic  signs  in  some  way.  The  mean  deviation  is 
secured  by  adding  the  absolute  values  of  the  deviations  (disre- 
garding sign)  and  dividing  by  their  number,  or  in  symbols, 

S|x| 


M.D. 

In  the  illustrative  example, 


N 


{Mean  deviation}        (12) 


-f-4.4 

^M^MM^ 

+  2.4 

MM^ 

+  .4 

1 

D                                             E-2.e 

F 
C 

^Mean  =  16.6 

1 

A                                                    B            -4.6 

1 

1     1    1     1     1    1     1    1     1     1     1    1     1    1     1    1  ' 
0        2        4        6        8       10       12      14       16 

S    1    1    1    1 

18      20      21 

14  4 

M.D.  =  ^  =  2.88. 

o 

This  simple  process  becomes  lengthy  if  the  mean  and  devia- 
tions are  written  to  several  decimal  places,  and  for  this  reason  a 
shorter  method  will 
next  be  introduced. 
The  procedure  is  il- 
lustrated by  Fig.  26. 
The  above  five  scores 
are  represented  by 
the  horizontal  bars, 
and  the  deviations 
from  the  mean  by  the 
hatched  and  dotted 
portions.    Since  the 

total  negative  deviation  is  equal  to  the  total  positive  deviation, 
the  deviation  for  the  entire  series  may  be  obtained  by  deter- 
mining the  total  negative  deviation  and  multiplying  the  result 
by  2.  Furthermore,  the  negative  deviation  may  be  found  by 
subtracting  from  the  sum  of  the  segments  AC  and  DF  the  sum 
of  the  original  observations  represented  by  AB  and  DE. 

The  complete  procedure  may  then  be  described  as  follows : 

1.  Arrange  the  observations  in  order  of  size.  * 

2.  Compute  the  mean,  (16.6). 

3.  Count  the  items  smaller  than  the  mean  and  multiply  their 
number  by  the  mean,  (2  x  16.6  =  33.2). 


Fig.  26.   Illustrating  deviations  from  the  mean 
for  five  scores 
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4.  Subtract  from  this  result  the  sum  of  the  items  smaller 
than  the  mean,  (33.2  —  26  =  7.2).  This  is  the  total  negative 
deviation. 

5.  Multiply  this  last  result  by  two,  (2x7.2=14.4),  and 
divide  by  N  to  determine  M.D.,  (14.4/5  =  2.88). 

The  work  through  step  4  may  be  checked  by  adding  the 
items  larger  than  the  mean  and  subtracting  from  this  sum  the 
product  of  the  mean  by  the  number  of  greater  items.  x 

It  is  evident  graphically  that  if  the  same  quantity  is  added  to^ 
or  subtracted  from  each  item  the  deviations  remain  unchanged. 
This  may  also  be  shown  algebraically.    If  A  is  the  quantity 
subtracted  we  may  write 

X'  =  X-  A, 

2Z'      2Z 


so  that 
and 


N         N 

M'  =  M-Ay 

x'  =  X. 


A, 


This  simplification  of  the  items  is  occasionally  useful  in 
further  shortening  the  calculation.  The  whole  procedure  is 
illustrated  by  another  series  as  shown  in  Table  16. 


Table  16.  Showing  the  Calculation  of  Mean  Deviation 

FOR  AN  UnGROUPED  SERIES 


X 

X'  =  X-  110 

123 
120 

5  items  larger 

than  M' 

6  items  smaller 

than  M' 

13] 

10 

9 

9 

7 
'   4^ 

2 

2 

1 

0 
I    0, 

Sum  of  items 
larger  than 
M'  =  48 

Sum  of  items 
►     smaller  than 
M'  =  9 

6  X  5.182  =  31.09 
-9. 

119 
119 
117 
114 
112 
112 
111 
110 
110 

22.09 

X2 

44.18 

M.D.=^^^  =  4.02 

Check: 

48. 
5x5.182  =  25.91 

22.09 

5.182  =  M' 
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In  determining  mean  deviation  it  is  theoretically  better  to 
take  the  deviations  from  the  median  instead  of  from  the  mean 
because,  as  can  be  readily  demonstrated,  the  total  variation  is 
less  about  the  median.  The  above  short  method  could  not  have 
been  used,  however,  if  the  median  had  been  employed,  since 
the  sum  of  the  deviations  about  this  average  is  not  zero.  Further- 
more, for  longer  series  it  makes  very  little  difference  numeri- 
cally which  average  is  selected.  The  mean  may,  therefore,  be 
used  in  ordinary  practice. 

In  the  case  of  the  frequency  distribution  the  same  method 
may  be  used  as  for  ungrouped  items,  the  values  of  the  observa- 
tions being  taken  at  the  mid -points  of  the  intervals,  that  is,  at 
class  values.  The  work  is  illustrated  with  the  following  problem : 


Class 

/ 

d 

fd 

90-100- 

80-90-   

70-80-   

3 
2 
1 

3^ 
4  J-11 

4j 

8  X  63  =  504 
370 
134 

60-70-   

0 

0 

2 

50-60-   

40-50-   

30-40-   

20-30-   

10-20-   

4^ 
3 

1> 

^8 

-1 
-2 
-3 
-4 
-5 

-4^ 
-6 

-5. 

►-15 

268 

Check: 

890 

20 

-4 

12  X  63  =  756 
134 

The  class  values  in  this  example  are  15,  25,  35,  etc.,  so  that 
the  mean  is  65—  -^j^  x  10,  or  63,  by  formula  (6).  There  are  8 
frequencies  below  63,  and  12  above,  since  the  5  in  the  interval 
60-70-  comes  at  65.  The  product  of  8  and  63  gives  a  result 
equal  to  the  sum  of  the  items  smaller  than  the  mean  plus  their 
deviations  from  the  mean.  Next,  the  sum  of  the  products  of 
the  smaller  class  values  by  their  corresponding  frequencies  is 
55x4  +  45x3  +  15x1  =  370.  Subtracting  this  last  result  from 
504  furnishes  the  total  negative  deviation  134.  Multiplying  this 
result  by  2  to  obtain  the  total  positive  and  negative  deviation, 
and  dividing  by  20  gives  M.  D.  =  13.4. 
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By  making  use  of  certain  abbreviations,  a  formula  for  mean 
deviation  may  now  be  set  up.   Let 

Am  =  the  class  value  of  the  interval  in  which 
M  lies, 
Na  and  Nb  =  the  number  of  observations  above  and 

below  M, 
Ta  and  Tb  =  the  sums  of  the  observations  above  and 
below  M, 
2  I  /rf  I  a  and  2  |  /rf  |  b  =  absolute  values  of  the  parts  of  2/d  above 

and  below  A^,  and 
h  =  the  width  of  the  class  interval. 

The  steps  in  the  calculation  on  page  105  may  now  be  combined 
so  as  to  give  the  checking  formula 

_  2(Ta  -  NgM)  ^  2(NbM  -  Tb) 
'    '  N  N 

M.D.  =  T.-n-M(N.-N,)^ 

N 

It  then  remains  to  find  Ta  and  Tb  for  the  frequency  distribu- 
tion.  These  are  clearly  given  by 

Ta  =  NaAm+{^\fd\a)h 

and  Tb  =  NbAm-  {i:\fd\b)h. 

Substituting  these  values  in  equation  (13)  and  noting  that 

we  have 

^  ^  ^  (2|/.|)/.  +  (..-M)(N.-N.),  |r"SS|  (14) 

l^    distribution 
which  is  the  desired  result. 

Applying  formula  (14)  to  the  problem  on  page  105,  we  find  that 
,^  ^  ^  26xl0+(65^-63)(12-8)  ^  ^  ^  ^3  ^^  ^^  ^^^^^^ 
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In  order  to  fix  the  method  of  calculation  and  to  warn  the 
student  of  the  difficulty  which  arises  when  A  is  not  taken  in  the 
interval  in  which  M  lies,  another  model  problem  is  next  given 
with  complete  computations. 


Table  17.  Illustrating  the  Correct  and  Incorrect  Calculation 
OF  Mean  Deviation  for  a  Distribution  using  Formula  (14) 


Correct  Method 

Incorrect  Method 

Class 
Value 

f 

d 

fd 

d' 

Sd' 

97.5 

22^ 

3 

66  "^ 

5 

110" 

92.5 

87.5 

68 
51 

►  Ara  =  169 

2 
1 

136  J-     253 

51  J 

4 
3 

272 
153 

'    638 

82.5 

28  J 

0 

— 

2 

56 

77.5 

471 

-1 

-47] 

1 

47 

72.5 

33 

-2 

-66 

0 

0. 

67.5 

21 

-3 

-63 

-1 

-211 

62.5 
57.5 
52.5 

9 
6 
2 

►iV6=120 

-4 
-5 
-6 

-36 
-30 
-12 

—  269 

-2 
-3 

-4 

-18 

-18 

-8 

►  -76 

47.5 

1 

-7 

-7 

-5 

-5 

42.5 

iJ 

-8 

-8j 

-6 

-6j 

N  =  289                      Xfd  = 

-16 

Xfd'  =  562 

Na-Nb=   49                 Z\fd\  = 

522 

2|/d'|  =  714 

M  =  82.5  -  ^;^  =  82.223 

289 

M.D.^'^' 

X  5-9.723  X  49 
289 

A 

m-  M  =  .277 

^^  ^     .  522  X  5  +  (.277)49 
289 

=  ^-11^  =  9.078 

7^  10. 

70 

The  work  on  the  left  is  correct,  while  that  on  the  right  with 
origin  at  72.5  is  quite  wrong.  In  case  it  is  found  that  A  does  not 
lie  in  the  interval  containing  the  mean,  this  should  be  adjusted 
at  once,  using  the  previous  results  as  a  check  on  the  mean.  The 
reason  for  the  incorrectness  of  the  method  on  the  right  may  be 
shown  by  noting  that  the  expressions  for  Ta  and  Th  on  page 
106  give  incorrect  results  in  this  case.  The  complete  proof  is 
left  as  an  exercise  for  the  student. 
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3.  The  Standard  Deviation 

In  order  to  introduce  the  next  measure  of  dispersion  we  may 
return  to  the  short  series  shown  at  the  beginning  of  the  preced- 
ing section.  A  measure  of  average  deviation  was  there  found 
by  adding  the  deviates  from  the  mean  regardless  of  sign.  By 
the  present  method  the  algebraic  signs  are  eliminated  by  squar- 
ing the  deviations  from  the  mean. 


X 

X 

X2 

S.D. 

21 
19 
17 

+  4.4 
+  2.4 
+  0.4 
-2.6 
-4.6 

19.36 

5.76 

.16 

6.76 

21.16 

/S2-2           /53.2 
\   iV         \    5 

14 

12 

=  VlO.64  =  3.26 

M  =  16.6 

2X2 

=  53.20 

The  quantity  — -—  might  now  be  used  as  a  measure  of  mean 

square  dispersion,  but  it  has  been  found  much  more  convenient 
and  theoretically  desirable  to  take  the  square  root  of  this  aver- 
age.  The  standard  deviation  is  therefore  defined  as 


S.D.  = 


2jc^      f  Standard  deviation, 


\    A^ 


original  form 


}(15) 


The  method  of  calculation  for  ungrouped  series  is  com- 
paratively simple,  but  in  order  to  obviate  the  squaring  of  deci- 
mals a  short  cut  is  usually  employed. 

It  has  been  shown  in  section  2  that 

x  =  X-  M  =  x'  =  X'  -  M'. 
Therefore  x^  =  {XT~  -  2  X'M'  -}-  {Ary 

and  2x2  =  2(X')'  -  2  Ar{^X')  +  N{M')^ 

But  since  NM' =  2X', 


we  may  write 


or 


N  A 


S.D.=^Wl-(Mr. 


'Standard  devia-^ 
tion  for  reduced  y    (16) 
series  J 
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Applying  this  formula  to  the  above  problem  we  have 


X 

Z'  =  X-  12 

(X')2 

'^1 

9 
7 
5 
2 

0 

81 

49 

25 

4 

0 

19 

S.D. -V's'      21.16 

17 
14 
12 

=  Vl0.64 
=  3.26, 

as  before. 

3/'  =  4.6 

2(Z';2 

=  159 

For  the  frequency  distribution  the  same  method  is  employed. 
Since  X'  =  dk  and  M'  =  (Ifd)h/X  (see  Chapter  VI,  section  2) 

the  formula  becomes       /     r^r-rr:: =r-7— ;;s        r  standard  de-] 


S.D.= 


\ 


llfd^      i^fd 


hy  i    \'iationfor    J^  (17) 
(^  distribution  j 


N         \   N 

the  calculation  being  carried  through  to  the  last  step  in  class 
units  when  the  result  is  then  multiplied  by  the  width  of  the 
class  inter\"al  h. 

The  work  will  be  illustrated  by  the  Otis  test  data  from  Table 
13.    It  is  necessan,'  to  calculate  onlv  one  column  of  items  in 


Table  18.   Illustrating  the  Computation  of  Stand.ard  Deviation 
FOR  A  Distribution  with  Check 


Class  Interval 

f 

d 

fd 

f(P 

i      ^' 

fd' 

fidT 

69.5-74.5    

6 

5 

30 

150 

4 

24 

96 

64.5-69.5 

2 

4 

8 

32 

3 

6 

18 

59.5-64.5 

3 

3 

9 

27 

2 

6 

12 

54.5-59.5    

6 

2 

12 

24 

1 

6 

6 

49..5-.S4.5 

10 

1 

10 

10 

0 

— 

— 

44.-5-49.5 

23 

0 

— 

— 

:     -1 

-23 

23 

39. .5-44. 5 

8 

-1 

-8 

8 

-2 

-16 

32 

34.5-39.5 

4 

-2 

-8 

16 

-3 

-12 

36 

29.5-34.5 

4 

-3 

-12 

36 

-4 

-16 

64 

24.5-29.5 

1 

-4 

-4 

16 

-5 

—  5 

25 

67 

37 

319 

-30 

312 

^fd 

ZfdP 

Xfd' 

2/fc/')2 

0-*  =  [V^  -  (i^]  X  5  =  [V4.7612  -  .3049]  x  5  =  2.11  x  5  =  10.55. 
Check:  (j  =  [VngS^  -  (|t)-]  x  5  =  [\/-1.6567  -  .2005]  x  5  =  2.11  x  5  =  10.55. 


♦  The  standard  deviation  is  frequently  symbolized  by  the  small  Greek  letter  <r. 
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addition  to  the  computation  for  the  mean.  The  quantities 
fd^  are  obtained  by  multiplying  each  value  of  d  by  the  corre- 
sponding fd  products.  These  may  be  checked  by  multiplying 
/  by  d^  Thus,  5  X  30  =  150,  4  x  8  =  32,  etc.,  or  6  x  25  =  150, 
2  X  16  =  32,  etc. 

A  more  complete  check  may  be  made  by  choosing  a  new  origin 
as  in  the  calculation  for  the  mean.   If 

d  =  d'±  1, 

d'=(dr±2d'  +  l, 

[  Check  on  1 

and  S/cf2  =  S/(c?0^db  2  S/c?'  +  A^.      i  standard  !^    (18) 

[  deviation  j 

In  the  above  problem  d  =  d'  +  1,  so  that  11  fd-  should  equal 
2/(rfO^  +  2  2/rf'  +  N.  Since  319  =  312  +  2(-  30)  +  67,  the  work 
is  checked  to  this  stage  in  the  calculation.  The  remainder  of 
the  computation  consists  in  substituting  the  appropriate  values 
in  formula  (17)  as  shown  in  the  work  under  the  model  problem 
in  Table  18.  It  will  be  noted  that  it  is  desirable  to  carry  the 
work  under  the  radical  to  four  decimal  places  if  the  answer  be 
required  to  two. 

Before  comparing  the  above  two  measures  of  dispersion  and 
noting  their  uses,  another  measure  of  variability  will  be  intro- 
duced. This  is  known  as  the  semi-inter-quartile  range,  or  more 
briefly,  as  the  quartile  deviation. 

4.  The  Quartile  Deviation 

This  measure  of  variability  is  defined  as  half  the  range  of 
the  middle  50  per  cent  of  the  observations  when  arranged  in 
order  of  size  or  in  a  frequency  distribution.  It  is  only  necessary 
to  determine  two  values,  Qi  and  Qs,  below  and  above  which  one 
quarter  of  the  measures  lie.  The  range  Qi  to  Q3  then  includes 
the  middle  half  of  the  observations  and  the  semi-inter-quartile 
range  is  defined  by  the  expression 

Q  ^  Qs-  Qi^      ; Quartile  deviation}      (19) 
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In  the  case  of  ungrouped  material  the  work  may  often  be  done 
by  inspection  as  shown  in  the  accompanying  table  of  total  state 
and  local  per  capita  expenditures  in  southern  states.  Maryland 
is  the  median  state  with  an  expenditure  of  $6.11  per  capita. 

Table  19.   Per  Capita  Expenditures  for  Education  in  Seventeen 

Southern  States 


State 

Per  Capita  Expenditure  for 
Education  in  1900 

Oklahoma 

District  of  Columbia 

$11.94 
10.68 

Delaware 

West  Virginia 

9.02 

8.75     Q3  =  $8.58 
8.41 

Texas      

Florida 

Louisiana 

7.72 
6.65 

Virginia 

Maryland 

North  Carolina 

6.61 

6.11     Q2  =  Md  =  $6.11 

5.44 

Tennessee 

South  Carolina      

4.96 
4.63 

Arkansas 

Alabama 

4.62     Qi  =  $4.59 
4.55 

Georgia 

Mississippi 

Kentucky 

4.55 

4.54 
4.36 

Q 


_  $8.58  -  $4.59- _ 


^^•^^"^  =  $2.00. 

La 


The  value  for  Q3  is  taken  halfway  between  the  expenditures 
for  Texas  and  West  Virginia,  or  at  $8.58,  and  similarly  for  Qi, 
which  is  $4.59.  Q  is  then  half  the  difference  between  these  two 
results,  or  $2.00.  It  v/ill  be  noted  that  nine  cases  lie  between  Qi 
and  Q3,  and  that  this  is  more  than  half  of  the  total  number 
of  items,  which  is  seventeen.  For  so  few  items,  however,  it  is 
hardly  worth  while  to  strive  for  a  more  accurate  result,  the  pur- 
pose of  the  table  being  to  furnish  only  rough  comparisons. 

The  differences  Q3  -  Q2  =  $2.47  and  Q2  -  Qi  =  $1-52  are  not 
equal  to  Q,  because  of  the  lack  of  symmetry  in  the  series,  but 
their  sum  is  of  course  equal  to  2  Q.  With  this  limitation  in  mind 
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the  value  of  Q  may  be  said  to  furnish  approximately  the  mag- 
nitude which,  when  laid  off  on  both  sides  of  the  median,  will 
include  the  middle  half  of  the  items. 

When  the  data  are  in  a  distribution,  the  values  for  Qi  and  Q3 
are  computed  in  the  same  way  as  the  median,  the  only  difference 
being  that  one  quarter  instead  of  one  half  of  the  observations 
are  counted  in  from  either  end.  The  procedure  may  be  illus- 
trated for  the  following  distribution  of  intelligence  quotients. 
These  data  are  taken  from  a  survey  made  in  several  counties  in 
Illinois,  the  results  of  the  study  being  as  yet  unpublished. 


Table  20.   Illustrating  the  Computation  of  Quartile  Deviation 

FOR  A  Distribution 


60-70- 
50-60- 
40-50- 
30-40- 


I.Q. 

/ 

150-160- 

21 

140-150- 

12 

130-140- 

36 

rfdo 

120-130- 

103 

110-120- 

318  J 

100-110- 

799: 

=  /3 

90-100- 

1074 

80-90- 

1059 

70-80- 

868  : 

=/l 

366 

163 

25 

9 


*'Jup 


4834 


Q3  =  110  -  1208.5  -  471  X  10  =  100.770 
799 


Q         7Q  ^  1208.5  -  563  X  10  =  77.437 
868 


Q3-Qi=  23.333 


.-.  Q  =    11.67 


Qs  and  Qi  may  be  computed  most  readily  from  formulas  sim- 
ilar to  those  used  for  the  median.  If  one  quarter  of  the  cases 
be  counted  in  from  either  end  of  the  distribution  the  formulas 
become 


and 


(?3  =  u.l- 


Qi  =  1.1  + 


X  h 


(20  a) 


Quartiles       1 
for  distribution  j 


X      hy 


(20  b) 
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where  /i  and  /s  are  the  frequencies  of  the  intervals  where  Qi  and 
Q3  occur  and  the  other  symbols  are  used  as  in  the  formulas  for 
the  median.  The  calculation  is  shown  in  full  at  the  right  of  the 
distribution.  A  check  may  be  made  by  counting  in  three  quar- 
ters of  the  way  from  either  end  of  the  distribution  and  using 
similar  formulas. 

5.  Comparison  of  Measures  of  Dispersion 

In  order  to  bring  together  the  quantitative  methods  discussed 
thus  far,  all  the  simple  averages  and  measures  of  dispersion 
have  been  computed  for  the  above  distribution  and  located 
graphically  on  a  histogram.  The  student  should  work  out  and 
verify  the  following  results : 

Mean  =  89.28  M.  D.  =  13.65 

Median  =  89.31  S.  D.  =  16.86 

Crude  mode  =  95.00  Q  =  11.67 

The  close  agreement  of  the  mean  and  median  would  seem  to 
indicate  a  high  degree  of  symmetry  in  the  distribution,  but  con- 
trary to  expectation  the  data  do  not  furnish  a  good  example  of  a 
normal  probability  curve  as  will  be  shown  in  Chapter  XIII. 

As  illustrated  by  Fig.  27  a  range  of  2  Q  includes  the  middle 
50  per  cent  of  the  observations,  a  range  of  2  M.  D.  (from  the 
mean)  somewhat  more  than  half  of  the  cases,  while  a  range 
of  2  -S.  D.  includes  about  two  thirds  of  the  items.  Furthermore, 
the  ratio  of  Q  to  S.D.  is  approximately  .69,  while  the  ratio  of 
M.  D.  to  S.  D.  is  .81.  These  are  typical  of  the  results  found  with 
fairly  symmetrical  distributions.  For  the  normal  probability 
curve  these  two  ratios  are  .6745  and  .7979  respectively  (Chap- 
ter XII). 

By  laying  off  the  standard  deviation  three  times  to  the  left 
and  to  the  right  of  the  mean,  a  range  of  6  S:  D.  from  38.70  to 
139.86  is  obtained.  By  referring  to  Table  20  for  the  frequencies 
it  will  be  noted  that  about  4811  cases,  or  99  per  cent  of  all 
the  observations,  lie  within  this  range.  For  distributions  of  this 
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type,  then,  deviations  greater  than  3  S.  D.  from  the  mean  occur 
very  infrequently.  Similarly,  a  range  of  7^  M.  D.  will  extend  from 
38.09  to  140.47,  while  a  range  of  9  Q  (laid  off  from  the  median) 
runs  from  36.80  to  141.82.  Within  all  three  of  the  above  ranges, 
therefore,  more  than  99  per  cent  of  the  cases  will  ordinarily  occur. 

Frequency 


1100- 
1000- 
900- 
800- 
700- 
600- 
500- 
400- 
300- 
200- 
100- 


30 


40        50        60       70 


80  y^o    flOO 

M.&Md.Mode 

< 2Q H 

2  M.D. - 

2  S.D. 


110       120       130       140      150       100 


I.Q. 


Fig.  27.   Illustrating  the  comparative  magnitude  of  several  measures 

of  dispersion 


As  regards  clear  definition  there  is  little  choice  between  the 
three  measures  of  dispersion  when  the  data  are  arranged  in  a 
frequency  distribution.  For  undistributed  series,  however,  the 
quartile  deviation  has  the  same  defects  as  the  median.  As  illus- 
trated in  Table  19,  it  is  sometimes  necessary  to  take  the  average 
of  two  neighboring  values  for  Qi  or  Qs. 
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The  algebraic  properties  of  the  standard  deviation  make  it 
the  most  useful  for  combining  the  results  from  several  series, 
and  in  connection  with  other  statistical  formulas.  Thus,  if  two 
series  of  size  Ni  and  N2  have  a  total  population  of  N  =  Ni  +  N2, 
with  means  Mi  and  M2,  and  standard  deviations  ai  and  0-2,  it  is 
possible  from  these  values  to  find  the  mean  M,  and  the  standard 
deviation  a,  of  the  whole  group. 

It  has  already  been  shown  in  Chapter  VI  that 

,^  _  NiMi  +  N2M2 

The  standard  deviation  of  the  total  series  may  also  be  found. 
From  the  proof  in  section  3  it  is  apparent  that  if  the  assumed 
mean  A  be  taken  equal  to  M  for  both  series, 

Ml-  M  =  Ci 
and  M2—  M  =  C2. 

The  mean  square  variations  of  the  component  series  about  M  are, 
by  equation  (16),  ^^f/^'  =  ai^  +  Ci^  and  ^^^7^  =  crs^  +  C2^ 

JMl  1^2 

respectively.  The  total  square  variation,  or  SZ^,  of  both  groups 
about  M  is  therefore 

S(X'i)2  +  2(X'2)2  =  Ni((7i2  +  Ci2)  +  N2{(T2^  +  Ca^), 

or  N(y^  =  Ni{(Ti^  +  Ci2)  +  N2((y2^  +  Ca^),  (21) 

and  in  case  both  the  means  and  samples  are  the  same  size  we  have 

(72  =  J((Ti2  +  (722). 

The  quartile  deviation  is  probably  the  easiest  measure  of 
variability  to  compute,  the  mean  deviation  next,  and  the  stand- 
ard deviation  most  laborious  of  all.  Simplicity  of  calculation, 
however,  should  rarely  determine  which  measure  of  dispersion 
to  employ  since  other  properties  are  much  more  important. 

The  standard  deviation  is,  in  general,  less  affected  by  fluctua- 
tions in  sampling  than  Q  or  M.D.,  and  for  this  reason  alone  is 


f     tior 
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preferable  to  the  others.  It  is  sometimes  argued  that  the  pres- 
ence of  a  few  extremely  lar^e  or  small  observations  may  affect 
the  standard  deviation  unduly,  but  if  such  items  are  truly  a 
part  of  the  distribution  this  objection  is  overruled.  ~v^ 

/  In  view  of  all  of  the  above  properties  the  stiindard  deviation 
is  the  best  meiisure  of  variability  to  employ  for  the  fairly  sym- 
metrical distributions  ordinarily  found  with  eduaitional  or 
psychological  datii.  A  fairly  safe  rule  with  such  material  is  to 
use  the  mean  and  the  sUmdard  deviation  whenever  the  data 
warrant  careful  treatment,  reserving  the  median  and  Q  for 
rough  work  with  small  samples. 

6.  The  Coefficient  of  Variation 

The  measures  of  variability  discussed  thus  far  have  two 
properties  that  are  at  once  apparent. 

1.  They  are  expressed  in  the  units  of  the  variable  so  that 
direct  comparisons  of  dispersion  can  be  made  only  between 
series  on  the  same  scale. 

2.  They  depend  upon  the  size  of  the  deviations  from  some 
central  tendency,  but  are  quite  independent  of  the  magnitude 
of  the  average  itself. 

A  measure  of  variability  which  is  independent  of  the  scale 
units  and  which  takes  into  account  the  size  of  the  deviations  rel- 
ative to  the  meim  may  be  expressed  in  the  form  \[^1  —  \n, 

which  reduces  at  once  to  — -•    Professor  Pearson  has  called  this 

M 

(quantity  (when  multiplied  by  100  for  convenience)  the  coefficient 

of  variation,  or  percentiige  ratio  of  the  sUmdard  deviation  to  the 

arithmetic  mean.    Denoting  this  new  measure  of  variation  by 

V,  we  have  ,qq 

V  =  -'    (Coefficient  of  variation;  (22) 

M 

It  should  be  noted  that  while  a  is  the  standard  deviation  of 
A',  V  is  the  standard  deviation  of  100  X/M.    The  student  who 
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has  difficulty  in  \'isualizing  the  significance  of  the  coefficient  of 
variation  may  thus  regard  it  as  the  dispersion  found  when  all  of 
the  obsen'ations  (or  deviations  from  the  mean )  have  been  made 
comparable  by  di\iding  each  by  M  100. 

Direct  comparisons  of  measures  of  absolute  variability  such 
as  standard  de\'iation,  and  relative  variability  as  given  by  the 
coefficient  of  variation,  often  lead  to  confusion.  Both  are  root 
mean  square  measures  of  variability,  but  of  quite  different 
things  as  shown  above. 

A  simple  example  may  illustrate  this  point : 

Ml  =  20  problems,     cji  =  4  problems.     .'.  T'l  =  20 
Mo  =  40  problems,     0-2  =  4  problems,    .'.  Vo  =  10 

These  two  series  are  equally  variable  as  to  absolute  disper- 
sion, but  the  relative  variability  in  the  first  group  is  twice  that 
in  the  second.  Both  measures  are  entirely  correct,  although  it 
has  been  argued  by  Franzen  *  that  the  coefficient  of  variation 
should  not  be  used  \^ith  such  material  because  of  the  arbitrary 
nature  of  the  zero  point  on  educational  tests  and  scales.  This 
amounts  to  objecting  to  the  coefficient  of  variation  because  the 
size  of  the  mean  is  arbitrary',  but  on  the  same  grounds  we  should 
object  to  the  use  of  the  mean  itself. 

The  chief  use  of  the  coefficient  of  variation  is  in  comparing  the 
dispersion  of  series  where  the  means  differ  considerably  in  size  and 
where  the  variation  relative  to  the  mean  is  therefore  important. 

The  following  distributions  ip.  118)  give  the  per  capita  state 
and  local  expenditures  of  forty-nine  states  (including  the  Dis- 
trict of  Columbia)  for  elementary"  and  secondary'  education, 
and  for  all  purposes  in  1920. 

If  the  standard  de\iation  had  been  employed  in  comparing 
the  variability  of  these  two  groups,  it  would  have  appeared  that 
there  is  much  more  uniformity  among  the  states  in  educational 
expenditure  than  in  total  expenditures.    Using  the  coefficient 

•Raymond  Franzen.  "Statistical   Issues,"  Journal  of  Educational   Psychology, 
September.  1924.  p.  381. 
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Table  21.  Per  Capita  Expend- 
iture   FOR    Education    in    49 
States  * 


Expendituue 

$21-$22.99  

$19-$20.99  

$17-$18.99  

$15-$16.99  

$13-$14.99  

$11-$12.99  

$9-$10.99  

$7-$8.99  

$5-$6.99  

$3-$4.99  

Total 

M  =  $10.94 
a=  $4.87 
y  =  44.5 


/ 


49 


Table  22.  Per  Capita  Expend- 
iture FOR  All  Purposes  in  49 
States  * 


Expenditure 

/ 

$100-$109.99       

1 

$90-$99.99        

- 

$80-$89.99         

1 

$70-$79.99         

1 

$60-$69.99         

8 

$50-$59.99         

7 

*«40-$49.99         

6 

$30-$39.99         

12 

$20-$29.99         

6 

$10-$19.99         

7 

Total 

49 

M  =  $43.16 
a  =  $20.07 
F  =  46.5 


of  variation,  however,  we  find  little  difference  in  relative  dis- 
persion. While  both  results  are  correct  for  some  purposes,  the 
latter  gives  the  better  measure  of  the  relative  extent  to  which 
these  two  types  of  expenditure  have  become  stabilized.  A  varia- 
tion of  a  dollar  in  the  first  group  is  comparable  with  a  variation 
not  of  one  but  of  about  four  dollars  in  the  second  series.  For 
such  problems  the  relative  rather  than  the  absolute  dispersion 
should  be  used  to  show  the  degree  of  uniformity  in  expenditure. 


7.  Comparable  Measurements 

One  of  the  most  important  uses  of  variability  is  in  furnishing 
units  for  the  comparison  of  measurements  on  unlike  scales.  Be- 
cause of  its  algebraic  nature,  the  standard  deviation  is  the  most 
useful  for  this  purpose.  The  standard  scores  on  tests  A'l,  A'o,  A'3  •  •  • 
may  then  be  defined  as  the  deviations  from  the  mean  divided  by 

'y  /v*  y^ 

the  respective  standard  deviations,  or  — »  — '  —  •  •  •  .  Like  the 

(Ti    (7w    a^i 

♦Adapted  from  Miss  Newcomor's  figures  in  "Financial  Statistics  of  Public 
Education  in  the  United  States,  1910-1920."    The  Macmillan  Company,  1924. 
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coefficient  of  variation  these  scores  are  cleariy  abstract  num- 
bers, since  they  result  from  dividing  a  denominate  number  by  a 
quantity  of  the  same  denomination. 

It  will  be  noted  that  the  standard  score  of  a  pupil  gives  his 
relative  position  in  the  group  in  terms  of  a  number  of  standard 
deviations  above  or  below  the  mean.  Thus,  if  the  raw  score 
be  120  with  M  =  90  and  a  =  10,  the  standard  score  will  be 
120  -  90 


10 


=  -f  3.  If  on  another  test  this  pupil  scores  18  with 


M  =  12  and  cr  =  2,  his  standard  score  will  again  be  +  3.  His 
relative  position  in  the  distributions  of  both  tests  is  clearly  the 
same  as  shown  by  the  standard  scores. 

Being  abstract  numbers,  standard  scores  on  several  tests  may 
be  combined  by  addition.  The  only  caution  that  needs  to  be 
observed  is  that  the  various  distributions  from  which  the  original 
scores  are  taken  for  comparison  shall  be  of  the  same  general 
shape.  For  a  very  skewed  distribution  an  observation  one  S.  D, 
above  the  mean  of  the  series  is  not  comparable  with  a  meas- 
urement one  S.  D.  above  the  mean  of  a  symmetrical  group. 

In  order  to  illustrate  the  use  of  standard  scores  the  following 
data  resulting  from  seven  different  tests  are  presented : 


Table  23.   Standard  Scores  of  a  Pupil  on  Several  Tests 


X  = 

Test 

Mean 

S.D. 

Scores  of 
A  Pupil 

x  =  X-M* 

X 

<r 

1 

163 

10.2 

179 

+  16 

+  1.57 

2 

119 

8.1 

128 

+  9 

+  1.11 

3 

24 

6.0 

28 

+  4 

+  0.67 

4 

264 

39.8 

312 

+  48 

+  1.21 

5 

74 

8.2 

89 

+  15 

+  1.83 

6 

7.3 

2.1 

6 

-  1.3 

-  0.62 

7 

133 

16.4 

151 

+  18 

+  1.10 

Total 

893      • 

6.87 

Mean 

127.6 

0.98 

♦The  deviations  x  =  X  —  M  are  first  computed  and  then  each  is  divided  by  a  as 
shown  in  the  last  column. 
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If  a  composite  score  of  the  seven  tests  is  desired,  it  would 
not  appear  correct  to  add  the  scores  on  the  separate  tests,  be- 
cause they  are  in  unlike  units  and  undue  weight  would  be  given 
to  extreme  scores.  The  objection  that  unlike  quantities  should 
not  be  added  is  not  a  serious  one  because  even  horses,  pigs,  cows, 
and  sheep  may  be  added  together  to  secure  the  total  number  of 
farmyard  animals.  This  amounts,  of  course,  to  broadening  the 
unit  so  as  to  include  all  items  in  the  sub-classes  of  the  total 
group.  The  objection  against  the  extreme  weighting  of  some 
scores  may  be  more  important,  for  a  score  of  6  on  one  test  may 
represent  a  mental  effort  as  serious  as  a  score  of  179  on  another 
scale.  Both  of  the  above  difficulties  are  overcome  when  stand- 
ard scores  are  used,  the  only  trouble  being  the  amount  of  arith- 
metic involved  and  the  presence  of  positive  and  negative  scores. 

For  very  careful  work  the  best  method  for  comparing  meas- 
urements and  forming  composites  is  through  the  use  of  standard 
scores.  Test  scores  are  far  from  stable,  however,  and  great  pre- 
cision in  their  treatment  is  not  always  desirable  or  necessary. 
In  many  composite  tests  the  components  may  be  added  in  the 
unweighted  form  with  practically  as  good  results  as  by  the 
standard  score  or  other  methods  of  weighting.  This  is  illus- 
trated by  the  Terman  Group  Intelligence  Test  consisting  of  ten 
parts.  The  simple  total  of  all  points  made  was  found  to  agree 
(correlate)  almost  perfectly  with  the  composite  formed  by 
weighting  each  of  the  separate  tests  and  adding  them.  There  is 
considerable  disagreement,  of  course,  in  the  case  of  some  scores, 
but  when  fifty  to  one  hundred  cases  are  taken  these  individual 
differences  have  little  effect  upon  the  net  result,  especially  when 
the  number  of  test  items  is  fairly  large  and  they  are  not  ex- 
tremely uneven  in  weighted  value. 

Aside  from  the  question  of  precision  it  may  be  important  to 
represent  scores  in  the  standard  form  for  the  purpose  of  clearer 
interpretation.  By  a  very  simple  formula  based  on  standard 
scores  it  is  possible  to  transmute  the  results  on  any  number  of 
tests  so  that  they  all  have  the  same  mean  and  standard  devia- 
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tion.  This  method,  which  is  quite  old,*  is  frequently  rediscov- 
ered and  appears  from  time  to  time  in  a  slightly  different  form 
in  psychological  and  educational  journals. 

Let  Xi  and  X2  represent  the  scores  on  two  tests  expressed 
in  any  units.  Mi  and  M2  the  respective  means,  and  cri  and  0-2  the 
corresponding  standard  deviations.   We  may  now  write 

Xi  X2 

(7l         0-2 

or  0:1  =  —  X2. 

Since  x  =  X  —  M,  this  may  also  be  expressed  in  the  form 

r  Transmutation  formula  1 

Xi  =  Ml  -\ {X2  —  M2).  <  for  comparable   scores,  [      (23) 

^2  ^  score  form  J 

This  is  the  desired  transformation  which,  when  applied  to  X2, 
makes  its  mean  and  standard  deviation  equal  to  those  of  Xi, 
These  properties  are  apparent  from  the  preceding  equation. 
Hence  by  applying  this  formula  to  each  item  in  the  series  we 
may,  without  affecting  the  relative  position  of  any  value,  change 
the  series  so  that  it  will  have  any  mean  and  standard  deviation 
desired. 

As  an  example  we  may  select  Mi  =  50  and  ci  =  10,  these 
being  convenient  numbers.  By  the  application  of  the  above 
transformation  to  any  number  of  tests,  they  may  be  brought 
into  direct  comparison  with  the  one  selected  as  standard.  Thus 
the  series  of  X  scores  shown  below  may  be  transmuted  into 
comparable  T  scores  j  by  the  relation 

T  =  Zi  =  50  +  ^^y^  (X  -  3), 

or  T  =  24.02  +  8.66  X. 

*  Galton  introduced  comparable  measures  in  the  form  of  deviations  from  the 
median  divided  by  the  semi-inter-quartile  range. 

t  These  are  similar  to  McCall's  T-Scores.  See  William  McCall,  How  to  Measure 
in  Education.  The  Macmillan  Company,  1922,  See  also  Chapter  XII,  section  8,  of 
the  present  text. 


122 


STATISTICAL  METHODS  IN  EDUCATION 


X 

/ 

T 

/ 

5 

10 

67.32 

10 

4 

20 

M:,  =  3 

58.66 

20 

3 

2 

30 
20 

(J^  =  -^  =  1.155 
V3 

50.00 
41.34 

30 
20 

1 

10 

32.68 

10 

90 

90 

The  distribution  of  T  scores  obviously  has  a  mean  of  50  and 
a  standard  deviation  of  10.  By  similar  transformations  any 
number  of  series  will  have  these  same  properties,  so  that  the 
scores  on  all  tests  may  be  brought  into  direct  comparison. 
Thus  a  score  of  50  will  always  represent  the  performance  of  an 
individual  at  the  mean,  while  30  will  represent  the  score  of  a 
person  two  standard  deviations  below  the  mean,  etc.  If  such 
a  scaling  method  were  adopted  it  should,  of  course,  be  ap- 
plied only  to  large  groups  of  unselected  children  at  different 
ages  or  grades.  After  the  T  scores  have  been  worked  out  for 
the  different  tests,  transmutation  tables  should  be  prepared  so 
that  any  X  score  can  be  easily  transformed  into  the  correspond- 
ing T  score. 

8.  The  Measurement  of  Skewness 

Whenever  it  becomes  necessary  to  compare  several  distri- 
butions of  varying  degrees  of  asymmetry  or  skewness,  some 
numerical  measure  of  this  property  becomes  desirable.  Such 
a  measure  of  skewness  should  be  independent  of  the  unit  of 
measurement  for  the  variable  of  the  distribution.  Thus  for  a 
distribution  of  heights,  a  representation  of  skewness  is  needed 
which  will  remain  unchanged  whether  the  measurements  be 
made  in  inches  or  in  centimeters. 

One  such  measure  may  be  obtained  by  the  formula 


Sk 


(Qs  -  Md)  -  (Md  - 


_Qi  -\-  Qs-2  Md 


>f  ^^  r  Measure  of  1 
(^   quartiles 
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The  skewness  will  thus  be  positive  when  the  longer  tail  of 
the  distribution  is  in  the  direction  of  the  high  values  of  the 
variable  as  shown  in  Fig.  28. 

The  lowest  value  given 
by  (24)  is  clearly  zero  when 
the  distribution  is  sjTnmet- 
rical.  While  a  maximum 
value  of  2  may  be  obtained 
with  the  formula,  in  actual 
practice  results  beyond  the 
limits  it  1  are  rare. 

A  better  measure  of  skewness  is  given  by  Pearson's  formula, 


Qi  Md     Q, 

Fig.  28.   A  positively  skewed 
distribution 


Sk  = 


M  —  Mq     r Pearson's  measures)^  /ncx 
(J  \        of  skewness        j 


which  also  gives  positive  values  for  distributions  of  the  t>T)e 
sho\^Ti  in  Fig.  28.  Owing  to  the  fact  that  the  tinae  mode,  Mo,  is 
ver>'  difficult  to  determine,  this  formula  may  be  replaced  by 
another  expression  in  which  an  approximate  value  for  Mo  is 
employed.  Pearson  has  shown  that  for  moderately  skewed  dis- 
tributions, the  relation  between  mode,  mean,  and  median  is 

given  by 

Mo=M-'^[^M-  Md), 

Substituting  this  value  for  Mo  in  equation  (25)  we  find 

3(Af  —  Md)     (  Approximate  meas- 


Sk 


\     ure  of  skewness 


}(26) 


As  an  example  we  may  work  out  the  degree  of  skewness  in  the 
distribution  of  I.Q.'s  of  Table  20,  using  formulas  (24)  and  (26). 
Using  (24), 


St 


(100.77-89.31)  -  (89.31  -  77.437)  _ 


Using  (26), 


11.67 


=  -  .035. 


^       3(89.28  -  89.31)  ^^.^ 

^'  =  — 16:86 —  =  -  •^^^^' 

For  this  distribution  the  skewness,  measured  by  either  for 
mula,  is  negative  and  slight. 
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EXERCISES 

1.  Calculate  the  mean  deviation  and  the  standard  deviation  for 
the  following  scores  (ungrouped) :  166,  159,  158,  151,  150,  142,  131, 
126,118,101.  (M.D.  =  17.0;  S.D.  =  19.7.   Ans.) 

2.  Compute  the  standard  deviations  for  the  frequency  distribu- 
tions of  the  data  of  Exercise  3,  Chapter  II. 

((7o  =  19.9;  (7c=10.5;  0-^  =  23.5.   Ans.) 

3.  Calculate  the  mean  deviation,  standard  deviation,  and  quartile 
deviation  for  each  of  the  problems  of  Exercise  1,  Chapter  VI. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

M.D 

S.D 

Q 

6.85 
8.74 
5.94 

2.79 
3.61 
2.125 

1.40 
1.84 
1.125 

11.43 
15.08 
7.875 

16.63 
21.52 
11.29 

4.61 
5.67 
3.85 

5.25 
7.06 
4.31 

1.33 
1.67 
1.03 

1.43 
1.76 
1.30 

Ans. 

4.  Calculate  the  coefficients  of  variation  for  the  following  dis- 
tributions : 


Monthly  Salary  in  1914 

High-School 
Science  Teachers 

High-School 
English  Teachers 

$135-139.99 

130-134.99 

125-129.99 

120-124.99 

115-119.99 

110-114.99 

1 

3 

4 

4 

2 

10 

7 

26 

8 

16 

22 

15 

15 

5 

4 

2 

3 
1 

105-109.99 

100-104.99 

95-99.99 

1 
2 

90-94.99 

8 

85-89.99 

10 

80-84.99 

30 

75-79.99 

36 

70-74.99 

31 

65-69.99 

20 

60-64.99 

8 

55-59.99 

1 

50-54.99 

1 

144 

147 

Science:  M  =  $94.83.  a  =  $15.8.  V  =  16.7 
English :  M  =  $77.67,  a  =  $10.9,  V  =  14.0 

Ans. 

The  V's  are  more  nearly  alike  than  the  a's.    Explain. 
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5.  Verify  the  results  in  the  following  table : 

Comparison  of  Foreign-Born   Groups  for   Different  Numbers  of 
Years  in  the  United  States  in  Terms  of  Theoretical  Combined 
Scale  of  Intelligence  (Alpha,  Beta,  and  All  Individual  Examina- 
tions Combined)* 

(Intervals  are  to  be  taken  as  22-23  with  class  value  22.5  etc.) 


Years 

IN  United 

States 

Combined  Scat.r 

TOTAT 

0-5 

6-10 

11-15 

16-20 

Over  20 

22 

1.0 

0.4 

1.0 

0.4 

0.5 

3.3 

21 

2.8 

2.8 

2.4 

1.6 

3.6 

13.2 

20 

5.8 

8.1 

6.2 

3.83 

7.4 

31.33 

19 

14.0 

18.5 

12.98 

8.22 

12.1 

65.8 

18 

27.8 

38.1 

27.83 

17.94 

24.78 

136.45 

17 

55.5 

72.7 

52.24 

32.68 

41.11 

254.23 

16 

104.4 

142.4 

88.51 

59.62 

66.18 

461.11 

15 

172.5 

240.7 

139.85 

76.99 

86.44 

716.48 

14 

265.3 

355.2 

199.78 

115.24 

106.35 

1,041.87 

13 

368.8 

490.1 

273.95 

127.44 

127.11 

1.387.4 

12 

441.2 

597.0 

308.72 

119.86 

113.13 

1,579.91 

11 

461.5 

596.9 

247.31 

86.62 

69.95 

1,462.28 

10 

470.9 

529.9 

189.02 

50.39 

44.34 

1,284.55 

9 

454.3 

474.7 

150.88 

27.54 

28.48 

1,135.9 

8 

342.5 

347.4 

100.32 

17.08 

17.77 

825.07 

7 

212.7 

207.8 

57.38 

7.52 

9.29 

494.69 

6 

106.8 

101.6 

26.58 

3.92 

3.25 

242.15 

5 

44.8 

37.2 

10.02 

1.45 

.86 

94.33 

4 

16.4 

14.5 

3.74 

.50 

.25 

35.39 

3 

4.7 

4.3 

1.03 

— 

— 

10.03 

2 

1.5 

1.3 

.32 

— 

— 

3.12 

1 

•4 

.3 

— 

— 

— 

.7 

Total      .    . 

3,575.6 

4,281.9 

1,900.06 

758.84 

762.89 

11,279.29 

First  quartile    . 

9.36 

9.75 

10.66 

11.94 

12.15 

9.98 

Median       .    .    . 

11.29 

11.71 

12.53 

13.51 

13.74 

12.03 

Third  quartile  . 

13.34 

13.61 

14.28 

15.15 

15.59 

13.93 

Quartile    devia- 

tion   .... 

1.99 

1.93 

1.81 

1.61 

1.72 

1.98 

6.  Work  out  the  standard  scores  for  the  first  five  pupils  on  the  three 
intelligence  tests  of  Exercise  1,  Chapter  II,  using  the  means  and 
standard  deviations  already  calculated. 

Otis:  1.59  1.49  -.57  .09  -1.67 

Chicago:  -.17  2.07  -.31  -.74  -1.36 

Terman :  -  .32  1.21  .28  -  .83  -  2.28  Ans. 


*  Data  from  Memoirs  of  the  National  Academy  of  Sciences,  Vol.  XV,  p.  704. 
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7.   Convert  the  following  distribution  into  a  series  having  mean 
=  85  and  S,D.  =  V30. 


Given  Distribution 

Transformed  Distribution 

Class  Value,  X 

/ 

T 

/ 

80 

1 

96.62 

1 

70 

2 

92.75 

2 

60 

3 

88.88 

3 

50 

8 

85.00 

8 

40 

3 

81.13 

3 

30 

2 

77.26 

2 

20 

1 

73.39 

1 

20 

20 

Transformation  equation  is  T  =  .3873  Z  +  65.64. 

8.  Derive  formula  (17)  from  (16). 


CHAPTER  VIII 

THE  PERCENTILE  METHOD 

1.  Introductory 

There  is  nothing  essentially  new  in  the  method  of  percentiles, 
but  the  recent  wide  use  of  percentile  scores,  ranks,  and  curves 
in  dealing  with  test  data  warrants  a  somewhat  detailed  account 
of  this  method. 

It  is  hardly  worth  while  to  apply  the  percentile  method  in  any 
form  unless  the  data  are  sufficient  in  number  to  justify  their 
representation  in  a  frequency  distribution.  Percentiles  are  ob- 
tained in  the  same  way  as  the  median  and  quartile  values  which, 
as  we  have  seen,  are  not  well  defined  in  the  case  of  ungrouped 
items.  Furthermore,  the  irregular  nature  of  short  series  makes 
the  percentile  values  unstable  and  of  little  practical  significance. 
For  these  reasons  the  method  will  be  discussed  only  in  connec- 
tion with  frequency  distributions. 

2.  Percentiles 

A  percentile  is  a  value  of  the  variable  below  which  a  given 
per  cent  of  the  frequencies  lie.  These  values  may  be  denoted 
by  the  symbol  Pp,  where  the  subscript  p  refers  to  the  percentage 
of  observations  smaller  than  Pp.  Thus  Pio,  P25,  P50,  and  P82 
are  values  such  that  10,  25,  50,  and  82  per  cent  of  the  cases 
lie  below  them. 

From  this  definition  it  is  apparent  that  the  median  is  equal 
to  P50  and  that  the  quartile  values  Qi  and  Qs  are  equal  respec- 
tively to  P25  and  P75. 

Formulas  for  the  computation  of  percentile  values  may  now 
be  set  up  in  a  form  similar  to  those  used  for  the  median. 
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Let        p  =  the  percentage  of  the  cases  smaller  than  Pp, 
fp  =  the  frequency  of  the  class  where  Pp  occurs, 
fup  and  fdo  =  the  frequency  up  to  and  down  to  the  interval 

containing  the  required  percentile, 
u.l.  and  l.l.  =  the  upper  and  lower  limits  of  this  interval,  and 
h  and  N  =  the  size  of  the  interval  and  sample  as  before. 
The  formulas  then  become 

pN 


and 


p,  =  ;./.+ 


Ph  =u.l.- 


100 


—  fup 


fp 

100 


r  Percentiles,  "1  .^      . 

^  t  counting  up/         ^^*^^ 


100 


N-fdo 


fp 


rCounting^^ 
^'   I    down    /  (^^b) 


Table  24.   Illustrating  the  Computation  of  Percentiles 


Age  received  Ph.  D. 

/ 

45 

31 

44 

— 

43 

3 

42 

3 

41 

1 

40 

5 

39 

9 

38 
37 

5 
5 

-2^  =  .20  X  400  =  80 

100 

36 
35 

7 

7 

^ZOS=  fdo 

P20  =  24.5  +  ^^  ~  ^^  X  1 
00 

34 

10 

=  24.5  +  .684 

33 

13 

=  25.184 

32 
31 
30 
29 
28 
27 

17 
29 
42 
31 
27 
37 

Check : 

P.o  =  25.5   320 -308^  J 

=  25.5  -  .316 
=  25.184 

26 

54  J 

25 

38: 

=  fp 

24 

29] 

23 

14 

22 
21 

7 
2 

'5^=  fup 

20 

2} 

400 
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In  order  to  illustrate  the  use  of  formulas  (27  a)  and  (27  h),  the 
complete  calculation  for  P20  is  given  in  Table  24.  It  should  be 
noted  that  the  ages  are  given  at  class  values,  the  inten'als  being 
44.5-45.5,  43.5-44.5,  etc.  The  check  should  be  used  until  the 
student  is  confident  of  the  accuracy  of  his  calculations. 

By  similar  computations,  the  values  for  Pio,  P20,  up  to  P90 
may  be  obtained  and  set  down  as  follows : 

Pio  =  24.02,  P^,,  =  26.88,  Pro  =  30.43 
P2o  =  25.18,  P5o  =  28.13,  P,o  =  31.97 
P30  =  26.02,        Peo  =  29.47,        P.q  =  35.64 

These  percentiles  diWde  the  series  into  ten  equal  parts  so  that 
a  given  age  may  be  readily  located  in  any  part  of  the  distribu- 
tion. Thus  if  a  man  received  his  Ph.D.  at  twenty-six,  it  is 
at  once  apparent  that  30  per  cent  of  the  men  were  younger 
than  he  when  they  took  this  degree.  Similarly,  a  man  who  re- 
ceived the  degree  at  thirty-two  was  among  the  oldest  fifth  of 
the  entire  group. 

The  above  method  for  obtaining  percentile  values  is  the  most 
direct  and  accurate  one.  The  same  results  may  be  obtained 
more  easily,  however,  by  making  use  of  the  cumulative  fre- 
quency cun^e  as  described  in  Chapter  II.  The  computation  in 
this  case  is  graphical  and  the  accuracy  of  the  results  will  depend 
upon  the  construction  and  use  of  the  drawing.  When  adding  in 
from  the  lower  end  of  the  series,  the  cumulative  frequency  dis- 
tribution for  ages  may  be  arranged  as  shown  in  Table  25  on 
page  130. 

The  plot  of  these  data  is  shown  in  the  cumulative  frequencj^ 
cur\'e  of  Fig.  29  on  page  131.  The  p  scale  on  the  right  is  made 
by  di\iding  the  total  cumulative  frequency  scale  into  100  equal 
parts.  In  order  to  obtain  any  percentile  value  graphically  it  is 
only  necessarv'  to  find  the  required  percentile  index  p,  on  the 
p  scale,  move  to  the  left  from  this  point  until  the  cun'e  is 
reached,  and  then  drop  do^^n  vertically  to  the  percentile  value 
on  the  horizontal  scale. 
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Table  25.  Cumulative  Frequency  Distribution  for  Data 

OF  Table  24 


Age 

Frequency  Less  than  Given  Age 

45.5 

400 

44.5 

397 

43.5 

397 

42.5 

394 

41.5 

391 

40.5 

390 

39.5 

385 

38.5 

376 

37.5 

371 

36.5 

366 

35.5 

359 

34.5 

352 

33.5 

342 

32.5 

329 

31.5 

312 

30.5 

283 

29.5 

241 

28.5 

210 

27.5 

183 

26.5 

146 

25.5 

92 

24.5 

54 

23.5 

25 

22.5 

11 

21.5 

4 

20.5 

2 

Fig.  29  has  been  drawn  in  the  form  of  a  polygon,  consist- 
ing of  straight  Hnes  between  the  cumulative  frequency  points. 
While  it  is  sometimes  legitimate  to  smooth  in  the  points  by  a 
free-hand  or  fitted  curve,  the  student  had  better  confine  him- 
self to  the  use  of  the  polygon  until  he  has  made  a  special  study 
of  the  subject  of  smoothing. 

Although  greater  precision  may  be  obtained  by  the  use  of  the 
direct  method  of  computing  percentile  values,  the  equivalence 
of  the  two  procedures  may  be  readily  seen.  Because  of  the 
manner  in  which  the  cumulative  frequency  curve  is  constructed 
the  value  of  the  ordinate  gives  the  total  frequency  below  the 
corresponding  abscissa.  Thus  54  frequencies  lie  below  24.5, 
and  92  frequencies  lie  below  25.5.   By  joining  these  points  with 
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20.5        24.5        28.5      32.5        36.5       40.5       44.5 
22.5       26.5       30.5       34.5        38.5       42.5 


Age 


a  straight  line  as  in 
Fig.  30,  it  is  assumed 
that  the  increase  in 
cumulative  frequen- 
cies AB  is  directly 
proportional  to  the 
part  of  the  interval 
up  to  the  correspond- 
ing point  P  on  the 
horizontal  scale.  In 
this  case  the  fre- 
quency at  P  is  20 
per  cent  of  the  ob- 
servations, or  80,  so 
that  AB  =  26.  The 
value  of  X  is  there- 
fore found  from  the 
proportion 


l  =  38  =  -^^- 


Fig.  29.   Cumulative  frequency  curve  for  ages 
at  which  Ph.D.'s  were  received 

Adding  this  result 
to  24.5,  the  lower  limit  of  the  interval,  gives  25.18,  or  exactly 
the  same  result  as  by  the  direct  method  of  calculation. 


3.  Percentile  Curves 

A  percentile  curve  may 
be  made  by  plotting  the 
series  of  values  such  as 
those  worked  out  in  sec- 
tion 2.  The  ordinate  is 
the  value  of  the  percentile, 
while  the  abscissa  is  the 


24.5 


Fig.  30.   Enlargement  of  a  portion  of  the 

cumulative  frequency  curve  to  illustrate 

the  calculation  of  P20 


percentile  index  p.    Such  curves  should  be  distinguished  from 
cumulative  frequency  curves   where   integrated   frequency   is 
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represented  by  the  ordi- 
nate  and   values   of  the 
variable  by  the  abscissa. 
Both  of  these  are  often 
called   percentile   curves, 
but  it  is  more  in  harmony 
with   mathematical   con- 
vention to  name  a  curve 
according  to  what  is  rep- 
resented by  the  ordinate, 
and  this  is  the  basis  for 
the  above  distinction. 
Inspection  of  Figs.  29 
and  31  shows  that  the  two  curves  are  essentially  different  in 
form,  one  being  reversed  in  curvature  from  the  other.*    The 
percentile  curve  has  been  termed  by  Francis  Galton  an  ogive. 


Fig.  31.   Percentile  curve  for  Ph.D.  data 


Fig.  32.    Percentile  or  ogive  curve  for  Ph.D.  data 

Another  method  for  constructing  such  ogives  is  by  means  of 
the  cumulative  frequency  distribution.  The  difference  between 
this  method  and  the  one  just  shown  is  that  values  of  the  ends 


♦Fig.  32  is  the  mirror  image  of  Fig.  29  when  turned  through  ninety  degrees. 
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of  the  class  intervals  are  now  plotted  over  computed  p-scale 
values  instead  of  finding  the  percentiles  at  given  p  values,  10, 
20,  30,  etc.,  and  plotting  them. 

The  calculation  is  very  simple,  consisting  of  a  series  of  cumu- 
lative frequency  percentages  for  the  values  of  p  at  the  ends  of 
the  intervals.  In  Table  26  these  results  are  set  forth  for  the 
Ph.D.  data.  The  computation  may  be  done  most  readily  by 
setting  the  reciprocal  of  400  in  the  calculating  machine  and 
multiplying  it  into  the  series  of  cumulative  frequencies.  The 
curve  as  shown  in  Fig.  32  is  now  constructed  by  plotting  the 
values  from  Table  26  and  connecting  the  resulting  points.  These 
points  are  indicated  by  dots,  while  those  previously  calculated 

Table  26.   Illustrating  Cumulative  Frequency  Percentages 


Age 

fc  =  Frequency  Less  than 
Given  Age 

p  =  Percentage  ^ 
Frequency  Less  > 
THAN  Given  Age  J 

=  "^  X  100 

N 

45.5  . 

400 

100.0 

44.5  . 

397 

99.3 

43.5  . 

397 

99.3 

42.5  . 

394 

98.5 

41.5  . 

391 

97.8 

40.5  . 

390 

97.5 

39.5  . 

385 

96.3 

38.5  . 

376 

94.0 

37.5  . 

371 

92.8 

36.5  . 

366 

91.5 

35.5  . 

359 

89.8 

34.5  . 

352 

88.0 

33.5  . 

342 

85.5 

32.5  . 

329 

82.3 

31.5  . 

312 

78.0 

30.5  . 

283 

70.8 

29.5  . 

241 

60.3 

28.5  . 

210 

52.5 

27.5  . 

183 

45.8 

26.5  . 

146 

36.5 

25.5  . 

92 

23.0 

24.5  . 

54 

13.5 

23.5  . 

25 
11 

6.3 

2.8 

22.5  . 

21.5  . 

4 

1.0 

20.5  . 

2 

0.5 
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from  the  simple  distribution  are  given  for  comparison  by  small 
crosses.  It  is  obvious  that  these  latter  values  could  have  been 
obtained  graphically  by  making  use  of  the  ogive  formed  by 
the  dots. 

4.  Use  of  Percentile  Curves 

A  review  of  the  foregoing  paragraphs  will  show  that  percentile 
values  may  be  calculated  in  three  ways.  They  may  be  computed 
numerically  from  formulas  (27  a)  and  (27  b),  making  use  of  the 
simple  frequency  distribution ;  they  may  be  calculated  graphi- 
cally from  the  cumulative  frequency  curve;  and  finally  they 
may  be  obtained  graphically  from  the  ogive,  or  percentile  curve. 

The  particular  method  to  be  used  depends  upon  the  adequacy 
of  the  data,  the  number  of  percentiles  required,  and  the  accuracy 
needed.  Unless  the  data  are  fairly  plentiful  (one  hundred  or 
more  cases),  the  graphical  methods  are  usually  not  as  expedi- 
tious as  the  use  of  the  formulas.  Furthermore,  if  only  the  median 
and  quartiles  are  required,  it  does  not  pay  to  throw  the  data  into 
cumulative  or  ogive  form  in  order  to  obtain  them.  Finally,  if 
considerable  accuracy  be  needed  in  the  result,  the  numerical 
method  is  far  superior  to  the  others. 

In  case  a  number  of  percentiles  are  required  and  the  data  are 
sufficient  in  number,  either  of  the  above  graphical  methods  may 
be  employed  to  advantage,  where  only  fairly  accurate  results  are 
needed.  If  the  total  number  of  cases  gives  a  convenient  quotient 
when  divided  by  100,  the  p  scale  of  the  cumulative  curve  may 
be  readily  constructed,  and  this  method  is  probably  the  better 
to  use.  For  most  problems,  however,  the  total  frequency  is  an 
awkward  number  such  as  371,  so  that  the  gi'aphical  construc- 
tion of  the  p  scale  becomes  difficult.  It  is  therefore  usually  best 
in  constructing  both  the  cumulative  frequency  curve  and  the 
ogive,  to  use  the  percentage  frequencies  as  shown  in  Table  26. 

Another  use  of  percentile  curves  is  in  the  comparison  of  two 
series.  As  an  example,  one  of  Otis's  graphs  is  shown  in  Fig.  33. 
The  two  curves  shown  have  been  smoothed  free-hand,  but  as 
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already  indicated  the  subject  of  smoothing  is  beyond  a  course 
such  as  this,  and  the  student  will  ordinarily  do  better  to  take 
the  data  at  their  face  value  in  drawing  such  percentile  curves. 
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Fig.  33.   Illustrating  the  method  of  drawing  a  percentile  curve 


Otis  t  points  out  that  the  scores  in  Grade  5  B  are  appreciably 
higher  than  those  of  Grade  4  B,  but  that  on  the  whole  the  dis- 

*  From  Arthur  S.  Otis,  Statistical  Method  in  Educational  Measurement.  World 
Book  Company,  Yonkers-on-Hudson,  New  York,  1925. 

t  A.  S.  Otis,  Statistical  Method  in  Educational  Measurement,  p.  87.  World 
Book  Company,  1925. 
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tributions  of  scores  of  the  two  grades  overlap  very  markedly. 
He  goes  on  to  show  that  ''a  convenient  way  to  express  the  over- 
lapping is  to  state  the  per  cent  of  scores  of  Grade  4  B  that  exceed 
the  median  score  of  Grade  5  B,  or  to  state  the  per  cent  of  scores 
in  Grade  5B  that  fall  below  the  median  score  of  Grade  4B. 
Thus,  by  finding  the  point  on  the  4B  curve  having  a  height 
representing  a  score  of  37  (the  median  score  of  Grade  5B),  we 
find  that  the  upper  17  per  cent  of  the  scores  on  Grade  4B,  as 
indicated  by  the  curve,  are  above  the  median  score  of  Grade  5B. 
The  dotted  lines  indicate  the  solution."  Otis  also  shows  that 
such  curves  are  convenient  for  finding  and  comparing  various 
percentile  values.  Thus  the  pupils  at  the  10  and  90  percentiles 
in  the  two  groups  differ  less  widely  than  do  the  corresponding 
median  pupils.  This  is  shown  by  the  vertical  distance  between 
the  curves.  There  is,  however,  a  certain  amount  of  optical  illu- 
sion in  such  comparisons  which  makes  the  curves  appear  to 
bulge  apart  in  the  middle. 

5.  Percentile  Ranks 

The  percentile  curve  is  also  useful  in  determining  graphically 
what  are  known  as  percentile  ranks.  These  are  the  p  values  on 
the  horizontal  scale  for  such  a  curve.  The  percentile  rank  of  a 
given  score  is  therefore  the  per  cent  of  the  observations  below  that 
score  in  the  distribution.  In  obtaining  such  ranks  from  the  ogive 
it  is  only  necessary  to  find  the  given  score  on  the  vertical  scale, 
run  across  horizontally  until  the  curve  is  met,  and  then  drop 
down  at  right  angles  to  the  required  p  value,  or  percentile  rank. 
As  an  example,  making  use  of  Fig.  32,  let  it  be  required  to  find 
the  rank  of  a  man  who  received  his  Ph.  D.  at  36.  The  result,  as 
shown,  is  a  percentile  rank  of  about  91.  This  means  that  out 
of  one  hundred  such  men  nine  were  older  when  they  received 
this  degree. 

It  may  be  noted  that  for  percentile  ranks  100  is  high  and  1  is 
low,  which  is  contrary  to  the  ordinary  practice  of  assigning  1  to 
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89.75 


91.5 


Fig.  34.   Illustrating  linear  interpolation 
with  the  ogive 


the  highest  score  in  a  series.  Since  the  p  values  should  naturally 
increase  with  an  increasing  value  of  the  variable,  this  reversal 
seems  justifiable  in  the  case  of  percentile  ranks,  which  need  not 
be  confused  with  ordinary- 
ranks  if  properly  specified. 

A  formula  for  percentile 
ranks  may  be  derived,  mak- 
ing use  of  numerical  rather 
than  graphical  interpola- 
tion as  illustrated  above. 
This  method  will  be  first  il- 
lustrated by  an  enlargement 
of  the  portion  of  the  ogive 
including  age  36  as  shown  in  Fig.  34.   From  similar  triangles  it 

is  apparent  that  -^ — r-^-3  =  — '- — - — '- — ,  or  a  =  .875.   The  re- 
ob— oo.o  1 

quired  percentile  rank  is  therefore  89.75  +  -875  =  90.625.   Such 

great  accuracy  as  this  is,  of  course,  rarely  necessary,  and  the 

final  result  may  here  be  written  90.6,  or  possibly  91,  as  before. 

The  formula  for  percentile  ranks  may  now  be  set  up  by  letting 

X  =  the  value  of  the  given  score, 
Rx  =  its  percentile  rank, 

L  L  =  lower  limit  of  the  interval  containing  X, 
Ru  and  Ri  =  the  percentile  ranks  of  the  upper  and  lower  limits 
of  this  interval,  and 
h  =  width  of  the  class  interval. 
We  therefore  have  .    Percentile    ] 

R^  =  Ri-\-  ^"       ^^    (X-l.l.y  J  rank  formula,  I  (28) 
^  [        form  1        J 

As  an  illustration  of  the  use  of  this  formula  we  may  compute 
once  more  the  percentile  rank  for  the  age  36.  From  Table  26, 
R^  =  100X366  ^  ^^^3^  ^^^  ^^  ^  100^359  ^  g,^,^^  ^^  ,^^^^, 


400 
fore  obtain 

R36  =  89.75  + 


91.5  -  89.75 


400 


(36  -  35.5)  =  90.625  =  91. 
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A  more  direct  and  usually  more  convenient  formula  for  per- 
centile ranks  may  be  obtained  by  letting 

jx  =  frequency  of  the  interval  where  X  occurs, 
and       fup  =  frequency  up  to  this  interval. 

We  may  then  write 

^    ^  lOO[fx(X-ll.)+(fup)h]     /Percentile  rank^ 


Nh  '  I  formula,  form  2  j    ^^^^ 

Thus  from  Table  24  the  frequency  at  age  36  is  /^  =  7,  while 
the  frequency  up  to  this  interval  is  359.  Substituting  these 
values  in  formula  (29),  we  find  that 

T?        10Q[7(36  -  35.5)  +  359  x  1]      ^^  ^^^ 
^''  ~  400  X  1  ~  ^^'^  ^' 

as  before.  The  student  should  show  that  formulas  (28)  and 
(29)  are  equivalent. 

Instead  of  finding  the  percentile  rank  of  X  it  is  often  sufficient 
to  find  the  percentile  rank  of  the  class  value  of  the  interval 
where  X  occurs.    In  this  case,  formula  (29)  reduces  easily  to 

p   _  50/^  _L  ^00  (/"/>)  _  ^0/x  ,   J.      /Class  value/   .^^. 

'^'-ir^~Fr~-n^-^^^'    \      rank      |  (3^) 

Applying  this  form  to  age  36,  we  again  find  cRsc  =  90.625,  since 
36  happens  to  be  the  class  value  of  the  interval.  The  rank  91 
would  be  used  for  any  age  between  35.5  and  36.5,  according  to 
this  last  approximation,  and  this  is  often  sufficiently  accurate 
with  a  narrow  class  interval. 

The  percentile  ranks  of  a  set  of  scores  often  make  a  very  con- 
venient record  for  administrative  use.  This  may  be  illustrated 
in  the  case  of  a  group  of  graduate  students  who  were  given  an 
intelligence  test.  The  gross  scores  were  not  used  because  of  the 
lack  of  suitable  norms  for  such  groups.  By  converting  the 
scores  of  the  tests  into  percentile  ranks,  the  relativ^e  position  of 
each  student  in  the  group  could  be  seen  at  a  glance.  Thus  John 
Doe's  graduate  record  might  appear  as  follows : 
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General  average  of  college  marks A— 

Estimated  fitness  for  research Excellent 

Personality Pleasing 

Experience  in  college  teaching 3  years 

Age 25 

Percentile  rank  in  intelligence  test 97 

The  youth,  high  scholarship,  and  personality  of  this  man  are 
indications  of  future  success  in  college  teaching.  His  percentile 
rank  of  97  is  additional  evidence  in  this  respect,  because  in 
a  group  of  119  students,  he  was  exceeded  by  only  3  per  cent  in 
general  mental  alertness. 

Inasmuch  as  a  very  accurate  rank  for  such  purposes  is  not 
required,  the  graphical  method  of  determination  from  the  per- 
centile curve  may  be  conveniently  used.  An  error  of  1  per  cent 
in  the  percentile  rank  of  a  student  will  make  no  difference  in 
the  administrative  interpretation  of  the  test  result,  and  this 
degree  of  accuracy  may  be  easily  obtained  from  the  ordinary 
free-hand  graph. 

While  percentile  ranks  will  furnish  the  medians  and  quartiles 
of  the  original  distribution  of  scores,  the  ranks  should  not  be 
treated  like  actual  scores.  In  combining  scores  from  several 
tests,  for  example,  it  would  not  be  legitimate  to  add  the  raw 
scores  from  certain  tests  to  other  scores  expressed  in  the  form 
of  percentile  ranks.  The  distribution  of  such  ranks  will  approx- 
imate a  long  rectangle,  the  standard  deviation  of  which  is  of 
doubtful  significance.  It  is  therefore  much  better  to  keep  the 
data  in  their  original  form  for  most  purposes  and  to  convert  the 
items  into  percentile  ranks  only  for  such  uses  as  those  which 
have  been  described  above. 

In  general  the  whole  percentile  method  is  cruder  but  some- 
times more  convenient  than  methods  in  which  the  raw  scores 
are  employed  directly.  Percentile  curves  and  ranks  are  in  ex- 
tensive use  at  present,  but  for  careful  analytical  work  it  is  usu- 
ally best  to  employ  methods  based  on  the  actual  rather  than 
on  the  relative  values  of  the  scores. 
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EXERCISES 

1.  Calculate  the  nine  deciles  by  formula  (27)  for  the  distribution 
of  science  teachers'  salaries  given  in  Exercise  4  of  Chapter  VII. 

(Pio  =  $76.13;  P2o  =  $80.93;  P30  =  $85.50  ;  P40  =  $88.77  ;  P50 
=  $92.81;  P6o  =  $99.63;  P7o  =  $102.65;  P8o  =  $106.57;  Pgo^  $114.80. 

Ans.) 

2.  Construct  a  cumulative  frequency  curve  for  the  data  of  Exer- 
cise 1,  and  check  the  above  deciles  by  graphical  computation,  check- 
ing by  the  graphical  method. 

3.  Work  out  the  nine  deciles  for  the  distribution  of  the  fourth- 
year  high-school  group  from  the  table  in  Exercise  3,  Chapter  VI. 

(Pio  =  66.74;  P2o  =  81.39;  P30  =  92.76  ;  P40  =  102.23  ;  P50 
=  110.91;  P6o  =  119.01;  P7o  =  128.84;  Pgo^  138.98;  P9o  =  152.12. 
Ans.) 

4.  Construct  a  percentile  curve  from  the  values  obtained  in 
Exercise  3. 

6.  Calculate  a  table  of  class-value  percentile  ranks,  using  for- 
mula (30)  and  the  data  of  Exercise  3.  Check  by  the  cumulative 
frequency  curve. 

6.  Compute  by  formula  (29)  the  percentile  ranks  for  the  fol- 
lowing scores:  167,  35,  171,  81,  and  104,  using  the  distribution  of 
Exercise  3.    (96.4,  1.8,  97.5,  19.7,  42.0.    Ans.) 

7.  Prove  that  formulas  (28)  and  (29)  are  equivalent. 


CHAPTER  IX 


LINEAR  CORRELATION  WITH  QUANTITATIVE  SERIES 

1.  The  Meaning  of  Correlation 

Correlation  is  sometimes  defined  as  the  concomitant  variation 
of  two  traits.  This  definition  may  be  illustrated  by  the  scores 
of  fifty  pupils  on  the  Otis  and  Chicago  group  intelligence  tests 
listed  in  Exercise  1, 
Chapter  II.  In  run- 
ning through  the 
pairs  of  scores  for 
each  pupil,  it  will  be 
noted  that  a  high 
score  on  one  test  is 
usually  associated 
with  a  high  score  on 
the  other,  while  a 
low  score  on  one  test 
tends  to  be  paired 
with  a  correspond- 
ingly low  score  on 
the  second  test. 

This  relationship, 
or  correlation,  is  brought  out  more  clearly  by  means  of  a  scatter 
diagram,  which  is  merely  a  plot  of  the  associated  pairs  of  scores 
as  shown  in  Fig.  35. 

There  is  a  general  tendency  in  this  diagram  for  the  points  to 
form  a  straight  band  across  the  graph,  and  this  furnishes  graph- 
ical evidence  of  linear  correlation.  The  narrower  the  band  and 
the  more  closely  the  points  cluster  along  a  straight  line,  the 

higher  such  correlation  becomes. 
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Fig.  35.   Scatter  diagram  of  the  Otis  and  Chicago 

test  scores.    (The  numbers   identify  the  scores 

given  in  Exercise  1,  Chapter  II) 


142 


STATISTICAL  METHODS  IN  EDUCATION 


In  Fig.  36  another  scatter  diagram  is  shown,  but  the  band 
in  this  example  forms  a  distinct  curve.  The  correlation  in 
this  case  is  regarded  as  non-linear,  but  like  linear  correlation 
the  relationship  between  the  two  variables  becomes  closer  as 
the  points  form  a  narrower  and  narrower  band,  finally  approx- 
imating a  single-valued  mathematical  function. 

Expenditure  per  pupil  in  dollars 

$4    6     8    10    12    14    16   18    20    22   24    26   28    30    32   34   36   38   40    42 
100 
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^  65 
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'O  55 

03 

S  50 

c 
<i) 
a 

X 

«J  40 

«t-i 

°  35 
^30 
§  25 
(^20 


15 

10 

5 

Fig.  36.   Showing  the  relationship  between  per-pupil  expenditure  and  per- 
centage of  total  school  expenditure  derived  from  the  state.  (Data  supplied  by 

Dr.  R.  E.  Wager) 

Perfect  correlation  is  reached  when  all  of  the  points  in  the 
scatter  diagram  fall  exactly  on  a  curve.*  Two  examples  of  such 
relationship  are  shown  in  Fig.  37,  one  for  linear  and  one  for 
non-linear  correlation.  With  observ^ed  data,  perfect  correlation 
is,  of  course,  impossible  but  very  close  approximations  are  often 
reached  in  verifying  physical  laws  such  as,  stress  =  k  X  strain. 


*  It  will  be  remembered  that  curve  is  a  general  expression  for  the  designation  of 
both  linear  (straight  line)  and  non-linear  functions. 
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In  view  of  the  above  discussion  we  may  now  define  correlation 
as  the  tendency  for  two  observed  variables  to  be  related  in  the  form 
of  a  single-valued  nuUhematical  function,  or,  more  briefly,  as  the 
tendency  toward  single- valued  functionalit}'.  A  single- valued 
function  is  one  such  that  for  any  value  of  the  argument  only 
one  value  of  the  function  results. 

The  present  chapter  will  deal  with  Knear  correlation  for  quan- 
titative series,  while  Chapter  X  will  be  devoted  to  the  measure- 
ment of  cunilinear  relationship.    For  both  t^-pes  of  correlation 


Fig.  37.  Dlustr- 


Perfect  non-tiiiear  oorrdatioo 

o  types  of  perfect  correlation 


it  is  possible  to  express  the  degree  of  association  in  numerical 
terms,  and  obtain  an  equation  for  the  mathematical  cun'e 
which  most  closely  approximates  the  data. 


2.  The  Product-Moment  Correlation  Coefficient 

In  Fig.  38  on  page  144  it  v^ill  be  noted  that  the  plane  has 
been  di\ided  into  four  quadrants  by  erecting  perpendiculars  at 
the  means  on  the  two  scales.  Designating  these  quadrants  in  the 
usual  way,  it  appears  that  points  located  in  the  first  and  third 
quadrants  will  tend  to  produce  high  correlation,  while  points 
located  in  the  other  two  quadrants  will  tend  to  reduce  the 
amount  of  such  correlation.  When  the  points  are  scattered 
randomly  over  the  plane,  the  correlation  will  approach  zero. 
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A  measure  of  relationship  might  then  be  obtained  by  con- 
sidering the  products  of  the  deviations  from  the  means,  that 
is,  a:  =  X  —  M^,  and  y  =  Y  —My.  For  a  point  such  as  Pi,  the 
product  xy  will  be  positive,  while  for  P2  in  the  second  quadrant, 

the  xy  product  will  be  neg- 
ative, etc.  The  average 
of  all  such  pair-products 
might  be  used  to  measure 
correlation,  were  it  not  for 
the  fact  that  the  deviations 
are  expressed  in  the  units 
of  the  respective  scales.  In 
order  to  overcome  this  dif- 
ficulty it  is  only  necessary 
to  use  standard  scores,  that 
X       ,  y 


My 


II 

2^2 

1 

^3 

H     ^3 
III 

M                1 
IV 

Ma^ 


Otis 


X 


Fig.  38.   Illustrating  product  moments  in 
four  quadrants 


IS,  —  and  — ,  and  the  result- 

0"x  CFy 


ing  product  average  will 
then  become  a  pure  number.  Thus  for  N  pairs  of  associated 
points  the  product-moment  correlation  coefficient  becomes 


/x^_yi 

\(Tx(Ti 


^^4.  ^^  +  ^^1  + 


(Tr  (J, 


(Tx  (Jt 


.XNyw" 


N. 


Representing  this  coefficient  by  the  symbol  r,  and  denoting  the 
sum  in  the  usual  way,  we  have 


or 


smce 


r  = ^^—y 

NaxCTy 

r  =        ^^^ 


Product-moment     1 


\/2x2  Si/2 


correlation  coefficient,  !►    (31) 
original  form         J 

(32) 


and  cr„  =  xM^- 


The  product-moment  coefficient  is  thus  the  arithmetical  mean 
of  the  pair-products  of  associated  standard  scores.    The  above 
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formula  was  developed  originally  by  Karl  Pearson,*  who  based 
his  proof  upon  the  product-moment  function  of  Bravais,  and  a 
method  of  Galton's  closely  related  to  the  above  standard  scores. 
It  is  therefore  often  called  the  Pearson  correlation  coefficient. 

If  all  the  points  in  the  scatter  diagram  lie  on  a  straight 
line,  the  equation  of  this  line  through  the  origin  (Fig.  39)  will 


o  o 


°     o    ° 


o    °    o 


X 
r=+i  r=o  r=-i 

Fig.  39.  Illustrating  extreme  variations  in  the  correlation  coefficient 


hey  =  ±  mx,  where  db  m  is  the  slope  of  the  line.  The  value  for 
the  correlation  coefficient  in  such  a  case  may  now  be  determined 
by  noting  that  ay  =  max, 


smce 


and  that 


y  N       yl    N 


We  may  then  write 


y^  ^^y 


Naxay 


^   ■  mNax^  =  4-1 
mNax^ 


In  case  all  the  points  lie  on  a  horizontal  line  with  a  zero  slope, 
the  value  for  r  becomes  indeterminate,  that  is,  t:*  A  symmetri- 
cal arrangement  of  the  points  about  such  a  line,  however,  will 
give  zero  correlation,  as  shown  in  Fig.  39,  because  the  quantity 
2x2/  is  zero  while  N,  ax,  and  ay  are  not  zero.  With  actual  data, 
therefore,  the  correlation  coefficient  may  range  in  value  from 
-1  to  1. 

*  A  full  discussion  of  the  history  of  correlation  is  given  by  Pearson  in  Biometrika, 
Vol.  XIII  (1920).  Here  Pearson  assigns  most  of  the  credit  to  Galton  and  minimizes 
the  significance  of  his  own  important  contribution. 


146         STATISTICAL  METHODS  IN  EDUCATION 

Z.  COMFUTATUm  OF  THE  CjyBBELATUm  COBmOEMT 
WfTH  VSOBfJVFED  ITEMB 

While  fbnmikui  ^31;  and  ("^^2;  are  tueful  for  calcnfetfiiig  tlie 
eorreiatiofi  coefficient,  the  arithmetk  may  become  rather  tedious 
on  aeeoant  of  the  fact  that  the  deviataons  :r  and  y  are  usually 
giren  in  the  form  of  decimals  wfaidi  would  need  to  be  multiplied 
and  squared.  In  order  to  overcome  this  difficulty  an  alternative 
formula  win  be  given. 

Bemembermg  that  z=^  X—  M^,  and  that  y=Y—  M^, 
we  may  write 

xy=^XY-  YM^  -  XM^  +  M^M^, 
and  Zxy^ZXY  -  MX  Y  -  M^X  +  M^m;L{X) 

^ZXY-NMrM^, 
mnee         ZX  =  NM^,  ZY ^  NM^,  and  Zd)  =  iST, 

Frc/m  the  dKipter  on  diiqienion  it  is  also  evident  that 


andthat  <^t  =  ^^-^t^' 

Substituting  these  values  in  equation  r3I>  gives  the  desired 
formula, 

^/CZX^  -  NM^yYY^  -  NMJ^    ^  ^^»»^  on  faw«oorr?>   ; 

Thi«  e;  on,  aitliough  more  c^rr.  pi  icated  in  form  than 

tfjrmubx  i'/jZjf  m  generally  preferable  u>  the  latt^  Yfecame  it  in* 
volves  the  integral  scores  X  and  K  mther  than  the  deviations 
ar  and  y  na  fi<  .    By  designating  the  total  as  T,  it  is  evi- 

dent that  T=  ;«>;;,  and  the  above  formula  may  also  be  written 

y ^^^  -  '^'^9  /Corrdatioii  cofffdcot  ,     ,^^ 

It  i^ould  also  he  noted  that  when  the  variables  are  measured 
from  arf/itrary  origins.  At  and  A^,  the  last  two  formulas  may  be 
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applied  to  X'  =  X  -  A,,  and  to  Y  -^  Y  -  A,,.  Thia  foflows  at 
on  rp  from  the  fact  that 

iW"^  =  Mx  —  A^,  and  .1/'^  =  My  -  A^, 
3o  that  x'  =  x  and     ^'  =  2/. 

The  primed  variables  may  then  replace  the  un primed  variables 
throuj?hout  in  formulae  f33)  and  f34j,  which  meana  that  the 
correlation  remains  unchanj^ed  when  any  numbers  A  and  b  are 
subtracted  from  X  and  Y j  respectively. 

The  above  formulas  will  now  be  applied  to  a  short  problem 
with  the  data  in  the  form  of  listerl  pairs  of  associateri  scores  on 
two  testa,  X  and  Y .  The  material  is  too  scanty  to  have  any 
practical  value,  but  has  been  chosen  because  the  arithmetic  is 
short  and  the  attention  may  be  fixed  on  the  form  of  computation. 


Table  27.    iLLrjsTRATiNG  thr  Comptjtattom  of  thr  pRonirrT-.VfoMRNT 
Correlation  Cobfpicient  by  Deviations  from  the  Mra.ms 


PtTPfT, 
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'/ 
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f» 
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17 
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3.41 

0.04 

+  O.JS 
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I.', 
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r      -         -      r 
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i      *">  .■'''• 

.  )  Jt 
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D 

4.T 

\Z 

-  !  1    ' 
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222.01 

27.04 

+  77.4a 

E 

7t 

VA 
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-  0.8 

1.5.21 

0.«4 

-3.12 

P 

»l 

ID 

+  13.1 

+  1.8 

ITlfJT 

3.24 

+  23. 58 

G 

Rfl 

20 

^%,\ 

+  2.8 

7.84 

+  22.«S 

H 

82 

23 

+  4,1 

+  5.8 

.  ^   ■'.  i 

33  64 

+  23.78 

r 

70 

20 

-Kl.l 

+  2.8 

1. 21 

7.-H4 

+  3.08 

r 

72 

14 

',  T) 

-  .",.2 

.14. --ll 

10.24 

+  18.«8 

MZ.fW    1or>jfo 


1,/y) 


C 


^  / 


I«2.20 


In  ap^'  '  ■"  formula  ^32)  it  us  first  nceawary  to  obtain  the 
means  and  deviations  from  the  means  for  the  two  aeries,  the 
latter  beinj^  given  in  the  columns  x  and  ^  above.  The  squareri 
and  product  terms  are  then  formed  and  the  sums  Sx^,  2^^,  and 
2jr?/    '         *^.   The  value  for  the  coefficient  then  b<  -         i 

162.2  lf)2.2  162.2 


r  = 


V562..9  X  1^ 


\/5r>442.24      24.3.8 


=  .665. 
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When  using  formula  (33)  or  (34),  it  will  be  found  convenient 
to  reduce  the  scores  before  calculating  the  necessary  quantities. 
Subtracting  70  from  each  of  the  X  scores  and  15  from  each  of 
the  Y  scores  gives  the  X'  and  the  Y'  series  shown  in  the  follow- 
ing illustration  (data  from  Table  27). 

Table  28.    Illustrating  the  Computation  of  the  Product-Moment 
Correlation  Coeifficient  by  Deviations  from  Assumed  Means 


Pupil 

X' 

y 

(XT- 

(y/)2 

(X'Y') 

A 

6 

2 

86 

4 

12 

B 

4 

0 

16 

0 

0 

C 

12 

-  1 

144 

1 

-12 
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-  7 

-  3 

49 

9 

21 

E 

4 

3 

16 

9 

12 

f 

21 

4 

441 

16 

84 

G 

16 

5 

256 

25 

80 
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12 

8 

144 

64 

96 

I 

9 

5 

81 

25 

45 

J 

2 

-  1 

4 

1 

-2 

Totals  .... 

79 

22 

1187 

154 

336 

3/' 

7.9 
624.1 

2.2 
48.4 

624.1  =  T'xM'x 

48.4  =  T'yM'y 
105.6 

173.8  =  T'xM'y 

T'M'     .... 

562.9 

162.2 

T'xM'y  =  T'yM'x  =  173.8. 


The  squaring  and  multiplying  may  now  all  be  done  mentally 
and  the  computation  arranged  as  in  the  foregoing  scheme.  Using 
formula  (34)  we  then  have 

162.2 


r  = 


-  -h  .665, 


\/562.9  X  105.6 
as  before. 

Of  the  two  types  of  calculation  shown  above,  the  second  is 
usually  the  easier,  although  both  become  rather  tedious  with 
a  long  series.  Formula  (33)  and  occasionally  formula  (32)  are 
recommended  for  short  series  of,  say,  20  to  30  pairs  of  scores, 
which  do  not  warrant  the  use  of  a  frequency  table. 

As  a  warning  to  the  student,  it  may  be  noted  that  correla- 
tions based  upon  such  a  small  number  of  cases  are  not  of  much 
significance.   In  experimental  work,  however,  problems  of  this  • 
sort  do  arise,  and  it  would  then  be  convenient  to  employ  the 
above  methods. 
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4.  The  Computation  of  the  Correlation  Coefficient 
FOR  A  Frequency  Table 

A  two-way  frequency  table,  or  correlation  table,  may  be  made 
by  noting  the  frequencies  which  occur  in  the  cells  bounded  by 
certain  class  limits  on  the  two  scales.  Thus,  by  taking  class 
intervals  of  79.5-89.5,  89.5-99.5,  99.5-109.5,  etc.  for  the  Otis 
test,  and  29.75-34.75,  34.75-39.75,  etc.  for  the  Chicago  test,  a 
scheme  for  tabulation  may  be  set  up  as  shown  in  Table  29  (see 
data  from  Exercise  1,  Chapter  II).  Instead  of  recording  a  pair 
of  scores  as  a  point  on  a  scatter  diagram,  it  is  only  necessary 
to  make  a  tally  in  the  cell  within  which  this  pair  of  scores  must 
lie.  All  frequencies  in  a  given  cell  are  then  assumed  to  have 
the  class  values  of  that  cell.  For  example,  the  scores  of  the  first 
pupil  are  171  and  52.   This  pair  of  scores  is  recorded  by  a  tally 

Table  29.  Correlation  Table  for  the  Otis  and  Chicago  Test  Scores  * 


Otis  Score 

80 
75 
70 
65 

1   60 
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O     ft  ft 

80     90      100      110      120      130      140      150      160      170      180  190 
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of  Boore.i 
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6 

35 

30 
Total 
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4 

/ 

1 

1 

1 

2 

4 

5 

11 

9 

11 

4 

1 

1 

50 

*  The  exact  class  limits  have  not  been  set  down  in  the  table,  but  these  should  be 
kept  in  mind  in  tabulating  the  frequencies  and  in  subsequent  calculations  for  the 
means. 
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in  the  cell  indicated  by  169.5-179.5  on  Otis  and  49.75-54.75  on 
Chicago,  with  the  resulting  class  values  174.5  and  52.25.  It  will 
be  noted  that  similar  errors  (differences  between  class  values 
and  observed  values)  appear  throughout  the  table,  but  that 
the  combined  effect  of  these  upon  the  correlation  coefficient 
will  not  be  large  if  the  class  intervals  are  fairly  numerous  on 
both  scales,  for  example,  from  10  to  20  intervals  as  in  the  case 
of  the  simple  frequency  distribution. 

Unless  a  mechanical  sorting  device  is  available  the  best  way 
to  make  a  correlation  table  is  with  the  aid  of  the  small  data 
tickets  described  in  Chapter  II.  These  may  be  sorted  into  a 
simple  frequency  distribution  according  to  one  of  the  variables, 
and  each  of  the  sub-piles  then  sorted  for  a  distribution  of  the 
associated  variable.  It  will  be  found  convenient  to  write  down 
the  class  limits  on  small  slips  of  paper,  laying  these  out  in  a  row 
on  a  long  table.  The  cards  are  then  sorted  into  piles  and  the 
work  verified  by  running  through  each  one.  These  piles  may 
then  be  secured  with  rubber  bands  and  a  new  series  of  class 
intervals  prepared  for  the  next  variable.  As  soon  as  each  pile 
has  been  sorted  in  this  way,  the  results  may  be  tabulated  in 
the  appropriate  column  on  a  sheet  of  square-ruled  paper  or  on 
a  special  form  to  be  described  below. 

In  calculating  the  correlation  coefficient  for  a  frequency  table 
it  will  be  necessary  to  modify  formula  (33)  so  as  to  bring  in 
the  frequency  notation.   Let 

/j  =  the  frequency  of  a  column  of  tjrpe  x, 
fy  =  the  frequency  of  a  row  of  type  ?/, 
fjry  =  the  frequency  of  a  cell  common  to  such  a  column 

and  row, 
dx  and  dy  =  the  deviations  in  class  intervals  from  the  assumed 

means  on  the  two  scales, 
h  and  k  =  the  widths  of  the  class  intervals  for  the  variables 

X  and  y,  respectively,  and 
X'  and  Y'  =  the  variables  measured  from  arbitrary  origins,  Ax 

and  .4^. 
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X'  Y' 

It  is  now  evident  that  dx  =  -j-  and  dy  =  -j-f 

so  that  2Z'y'  =  lldxdyhk  =  (Zfxy  •  dxdy)hk, 

fxy  being  merely  a  symbol  of  operation.    In  a  similar  way  it 
appears  that 


and 


N 


N 


Substituting  these  values  in  formula  (33)  gives 
r  =  —  = 


A^ 


Wxd> 


N 


Wydy^ 


N 


y/bi 


(35) 


{Correlation  coefficient  for  distribution  table} 


from  which  it  follows  that  the  correlation  coefficient  is  quite 
independent  of  the  magnitudes  of  the  class  intervals  and  of 
the  units  of  measurement.  The  three  principal  terms  in  this 
expression  have  been  designated  as  a,  h,  and  c  for  convenience. 
The  complete  calculation  with  formula  (35)  is  illustrated  in 
Table  30  on  page  152  for  the  Otis-Chicago  data. 

a  =  170  -  ^^-^^  =  170  -  7.2  = 


6  =  210 


=  210-11.52  = 


a 


50 

mi 

50 

c  =  225  -  ^^  =  225  -  4.5  = 
50 

162.8  162.8 


162.8 


198.48 


220.5 


y/bc      V198.48  X  220.5 

By  four-place  logarithms, 

log  h  =  2.2978 

log  c  =  2.3434 

log  prod.  =  4.6412 

log  Vprod.  =  2.3206 


162.8 


V43764.84      209.2 


=  .778. 


log  g  =  12.2116  -  10 
log  Vprod.  =    2.3206 

log  r  =    9.8910  -  10 

.-.  r=  1+  .778 
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The  computation  down  to  the  quantities  ^/^rfj^  and  ^fydy^ 
is  the  siime  as  for  the  standard  deviation,  so  that  the  values 
for  b  and  c  may  be  readily  obtained. 

The  calculation  for  a  presents  a  little  more  difficulty.  The 
quantity  ^frydjcdy  is  the  result  of  multiplying  each  cell  frequency 
by  its  rfx  and  dy  and  then  adding  all  the  products  so  formed. 
A  more  convenient  method  of  calculation,  however,  is  to  mul- 
tiply the  cell  frequencies  in  a  particular  column  by  the  appropri- 
ate (/v  values,  add  the  results  thus  found,  and  multiply  this 
sum  by  the  d^  value  of  the  column.  Continuing  in  this  way 
for  all  the  columns,  and  adding  the  products  thus  found  gives 
the  required  ^fryd-rdy.  Thus,  the  frequencies  in  the  column  at 
150-160  on  Otis  have  been  multiplied  by  the  corresponding 
dy  values  and  the  results  recorded  in  the  lower  left  corners  of 
the  cells  as  5,  6,  4,  4,  0  (coming  down  from  the  top).  The  sum  of 
these  quantities  is  19,  which,  when  multiplied  by  the  rf^  value 
2,  gives  38  as  the  contribution  of  this  column  to  the  total 
'Zfrydjcdy.  The  same  result  would  have  been  obtained  if  the  cell 
frequencies  had  been  multiplied  by  rf^  and  dy  at  the  same  time 
and  added,  thus: 

1x2x5+2x2x3+2x2x2+4x2x1+2x2x0 
-10       +      12        +        8        +        8        +        0=  38 

The  work  is  shortened,  however,  by  factoring  out  d^,  which  is 
common  to  all  the  products. 

The  symbol  w  has  been  used  to  indicate  summation  over  the 
whole  table,  that  is,  over  N  items.  In  order  to  distinguish  sum- 
mation over  the  arrays  (columns  or  rows),  this  has  been  desig- 
nated in  the  table  by  ^\  Thus,  ^'f^ydy  means  the  sum  for  one 
column  of  f^y  multiplied  by  the  corresponding  values  dy. 

A  very  useful  check  on  the  computation  of  a  is  shown  by  the 
double  arrow  in  Table  30.  The  sum  of  the  quantities  Z'/xA 
should  be  the  simie  as  17A,  or  ^(^'f^ydy)  =  ^fydy.  Until  he 
becomes  proficient  in  working  with  the  numbers  in  the  corners 
of  the  cells,  the  student  should  always  use  this  check. 
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Tlu'  comrtion  i^IAl^^IAl  applied  to  ly^W,  will  sonie- 

N 

times  be  positive  and  soinetimes  negative,  and  it  should  be 
remembered  that  it  is  to  be  subtracted  algebraically.  The  re- 
mainder of  the  calculation  for  r  is  shown  in  the  example  both  by- 
straight  arithmetic  and  by  logarithms  which  are  recommended, 
('omputiitions  of  this  length  need  to  be  carefully  planned  and 
arranged  in  order  that  they  may  be  done  (|uickly  and  accurately. 
A  stimdard  form  has  therefore  been  prepared  by  the  writer.  The 
dat^i  may  be  recorded  direi'tly  on  thisshtM.»t  (Table  31),  and  the 
calculations  performed  very  rapidly. 

5.  Links  of  Riogiuossion 

In  the  problem  just  worked  out.  it  was  assumed  that  the 
trend  of  the  datii  in  the  correlation  table  was  such  that  the 
linear  method  might  be  a])plie(i.  As  already  ])ointed  out, 
the  maximum  correlation  will  occur  when  all  the  points  fall 
on  a  straight  line ;  but  with  any  scatter  in  the  data,  two  lines  will 
be  obtained  for  a  correlation  table,  one  from  the  means  of  the 
columns  and  one  from  the  means  of  the  rows.  The  graphical 
test  for  approximate  linearity  and  the  justification  of  the  use  of 
the  product-moment  method  are  furnished  by  plotting  the 
means  of  these  arrays  and  noting  the  extent  to  which  they  fall 
on  these  two  straight  lines.  A  more  rigorous  test  will  be  given  in 
Chapter  X,  where  it  is  shown  that  lack  of  such  linearity  reduces 
the  amount  of  the  product -moment  correlation. 

The  cui*\'es  fitting  the  means  of  the  columns  and  rows  are 
known  as  regression  curres.  They  will  be  illustrated  with  a  larger 
body  t)f  data  than  that  used  above,  on  account  of  the  small  num- 
IxT  of  frecjuencies  in  the  Otis-Chicago  table.  The  material  in 
Table  32  was  supplied  through  the  courtesy  of  Mr.  Douglas  E. 
Scates.*  The  values  on  the  horizontal  scale  ;ire  the  percentage 

*"A  Study  i»f  lliKh-Srhool  luui  First-Year  University  Grades,"  The  School 
Revieu;  \'o\.  XXXll  iManh,  11>2J). 
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In  Fig.  40  the  means  <^  tlAe  coHimms  iLave  beein  pllotted  as 
dots  and  the  means  of  the  rows  as  croBaes.  Hie  Idrmer  ana j 
of  points  fits  ladier  doa^y  ai  akrai^  fine  draini  t^^ 
bal the nnans of  tlie lonvs fivin an iiTcgalar ciir¥B^  Whilefirce- 
hand  Carres  drawn  tliroq^  both  sets  of  points  would  givBioq^ 
appranmatianB  to  the  regreasiQn  cnrres,  it  is  better  to  employ 
a  matlimiatieal  metfiod  lor  fitting  soA  ggves.  bi  the  firitoiiing 
paragraph  we  shaD,  therefore,  Aacoas  a  method  for  obtaimog 
the  bcst-fittiQg  iiigieauion  cnrre  in  the  form  of  a  stra%ht  fine. 
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Fig.  40.    Regression  lines  for  1707  grades 
The  dots  represent  the  means  of  the  columns  and  the  crosses  the  means  of  the  rows 

If  we  select  the  line  for  the  means  of  the  columns  it  is  evident 
that,  when  taken  through  the  mean  of  the  table  at  M  (Fig.  41), 

its    equation    will    be    of 

the  form 

y  =  mx, 

where  m  is  the  slope  to 
be  determined,  and  y  de- 
notes a  point  on  the  line. 
If  P  represents  any  point 
in  the  table,  its  vertical  de- 
viation from  the  line  will 
be  1/  —  ?/,  as  shown  in  the 
accompanying  figure.  The 
problem  now  is  to  select 
m  so  that  the  sum  of  the  squares  of  these  deviations  (residuals) 
for  the  N  points  shall  be  as  small  as  possible,  that  is,  so  that 
^{y  —  vY  shall  be  a  minimum.  Replacing  Tf  by  mx,  and  ex- 
panding, we  may  write  the  necessary  condition  in  the  form 


80.5  82.6    84.5    86.5  88.5   90.5    92.5   94.5   96.5   98.5 

Fig.  4 1 .    Illustrating  the  derivation  of  the 
regression  line 
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2^/2  —  2  m1/xy  +  m^Sx^  =  a  minimum, 
or  (jy^  —  2  mrax<Ty  +  m^ax^  =  a  minimum. 

Differentiating  *  this  last  expression  with  respect  to  m,  and 
setting  the  result  equal  to  zero,  gives 

ay 

which  is  known  as  the  regression  coefficient  of  y  on  x.    The  re- 
quired equation  through  the  origin  thus  becomes 

/    _  \         r     Regression  line  for     "] 

^  =  I  r  — jjc.  -i  means  of  columns,  re-  !^  (36) 

\   O'x/        [^  f erred  to  mean  of  table  J 

In  case  the  student  is  not  familiar  with  the  differential  cal- 
culus, the  above  result  may  also  be  shown  in  the  following  way : 

Setting  Sy'^  =  C7y2  —  2  mraxCy  +  m^ax^, 

we  shall  assume  m  to  have  the  value  r  —  y  and  show  that  any  dif- 

ferent  value  will  produce  a  larger  squared  sum.    It  follows  that 

Sy^~  =  cr/  -  2  r  V,2  +  r%,^ 

o         ^^/i        32        r  Standard  error  1       ^c%>f\ 

Taking  m  =  r ~-{-  d, 

we  find  that 

S'y^  =  ay^-2  r^ay^  -  2  raxCTyd  +  rV^2  4.  2  rax(Tyd  +  (7,252^ 
or  S7  =  o-/(l  -  r^)  +  0-/52. 

No  matter  whether  d  be  positive  or  negative,  S'y^  is  greater  than 
Sy^  and  the  minimum  value  for  m  is  therefore  r  —  • 

By  similar  reasoning  it  may  be  shown  that 

/        \        r     Regression  line  for     "1 

3c  =  (  r  — ^  )^,  -"^  means  of  "rows,  referred  1-  (38) 

\   ^y'       [      to  mean  of  table       J 

*  If  the  reader  is  unfamiliar  with  the  calculus  he  should  skip  to  the  following 
paragraph. 
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where  r—  is  the  regression  coefficient  of  x  on  y.    The  two  re- 

(Ty 

gression  lines  given  by  (36)  and  (38)  furnish  not  only  the  best 
fit  to  all  the  points  in  the  table  but  also  to  the  means  of  the 
columns  and  rows  when  the  deviations  are  weighted  by  the 

frequencies  of  the  arrays.*   The  regression  coefficients  r—  and 

r—  give  the  average  change  in  y  and  x  for  a  unit  change  in  x 
ay 

and  y,  respectively. 

In  case  the  variables  are  taken  from  the  origins  of  the  meas- 
urements, equations  (36)  and  (38)  may  be  transformed  by  the 
relations  x  =  X  —  M^  and  y  =  Y  —  My,  giving 


Zi 

CTx 


Y=r-;^X-r-^M,-hMy    r   Regression   ^    (39) 

<  lines  in  score  y 


and 


X=r—Y-r^My  +  M^, 

0"y  0"y 


form 


J 


(40) 


Using  the  notation  a,  h,  and  c  as  in  the  calculation  of  the  cor- 
relation coefficient,  and  denoting  the  class  intervals  on  X  and  Y 
by  h  and  k,  respectively,  two  other  equations  may  be  obtained. 

Since  r  =  —j=^  (Tx  =  ( ^  —]h,  and  a,,  =  ( ,   —]k,  it  follows  that 

V6c         \\N/  Win/ 


and 


'Regression  lines  In')   (^\\ 
score  form  and  sym-  I 
bols  on  correlation  [ 

sheet  J   (42) 


These  last  equations  are  the  easiest  to  calculate,  since  all  the 
necessary  quantities  may  be  obtained  directly  from  those  given 
in  the  work  for  the  correlation  coefficient. 

For  the  university  and  school  data  in  Table  32  we  find  that 


a  =  17,468, 
b  =  28,838, 
c  =  28,249, 


Mj,  =  86.71, 
My  =  2.51, 
h  =  1,  and  k  =  -3-. 


Yule,  Introduction  to  Statistics,  p.  172. 
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Substituting  these  values  in  equations  (41)  and  (42),  we  find 

that  Y  =  .2019  X  -  15.00 

and  X  =  1.855  Y  +  82.05. 

These  regression  lines  are  plotted  in  Fig.  40  on  page  158. 
Representing  the  regression  coefficients  by 

{ Regression  coefficients } 

and  ^^y  =  ^^  =  7h'  (^^) 

(Ty  CR. 


it  follows  at  once  that  r  =  ^hxy  •  hyx .  For  the  above  data, 
therefore,  r  =  V. 3745245  =  .6120.  The  correlation  coefficient 
is  thus  the  geometric  mean  of  the  two  regression  coefficients. 
Equations  (43)  and  (44)  also  show  that  while  hy^  and  h^y  are 
functions  of  scale  units,  their  product  is  a  pure  number. 

Returning  to  the  equation  for  the  means  of  the  columns,  it  is 
evident  that  it  may  prove  useful  in  predicting  the  most  probable 
(average)  university  grades  Y  for  given  high -school  grades  X. 
Thus  a  student  entering  The  University  of  Chicago  with  a  high- 
school  average  of  95  will  probably  make  a  university  average  of 
.2019  X  95  —  15.00  =  4.18  grade  points,  or  a  little  better  than 
B ;  while  a  student  entering  with  a  high-school  average  of  85 
will  most  likely  have  a  university  average  of  2.16,  or  slightly 
better  than  the  required  C. 

A  measure  of  the  value  of  these  predictions  is  given  by  the 
standard  deviation  of  all  the  observed  variations  from  the  re- 
gression line.  This  quantity,  which  is  known  as  the  standard 
error  of  estimate,  has  already  been  presented  in  equation  (37) 
for  the  line  through  the  means  of  the  columns.  Working  out  a 
similar  formula  for  the  rows  and  multiplying  the  results  by  .6745 
in  order  to  obtain  the  probable  error  *  of  estimate,  we  have 

*  For  a  complete  discussion  of  probable  error,  see  Chapter  XIII.  The  probable 
error  is  so  defined  that  half  of  the  errors  lie  within  the  limits,  mean  —  probable 
error  and  mean  +  probable  error,  or  M  ±  P.  E. 
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T,r./     ^Tr\       n^^AtL  A o     f  Probable  GrFor  of  estliTiate  ^    ,^-v 

/'.£.(est.y)  =  .6745cr,Vl-r2    |    in  predicting  7  from  X    |  (^^) 

and  

T»  T-  /     X    v\       n>r>ic       , /i 9    r Probable  error  of  estimate"!    ..^. 

P.E.  (est.  ;f)  =  .6745  <r.  Vl  -  r'^.  |    .^  ^^^^.^^.^^  ^  ^^^^  ^   |  (46) 

Since  (7y  =(^/—) A:  =  1.356,  the  probable  error  of  estimate 


3.46  4.18  4.90 

Fig.  42.  Illustrating  the  probable  error 
of  estimate 


for  university  grades  is  .6745(1.356)  Vl  -  (.612)2  =  ,723.    Such 
calculations  are  facilitated  by  the  use  of  logarithms  of  Vl  —  r^ 

given   on   page   54   of   Hol- 

zinger's  Tables. 

The  complete  equation  for 

prediction     may     now     be 

written. 

y  =. 2019  Z- 15.00  ±.72. 

The  use  of  this  equation  may 
be  illustrated  in  the  case  of 
a  student  with  a  high-school 
average  of  95.  Substituting 
X  =  95,  we  find  that  1',  or 
the  most  probable  university  average,  is  4.18  grade  points. 
This  result  is  written  4.18  ±  .72,  with  the  interpretation  that 
it  is  an  even  chance  that  the  student's  university  average  will 
be  anywhere  from  3.46  to  4.90  grade  points,  or  between  B—  and 
A—.  This  conclusion  may  be  drawn  from  the  fact  that  half 
of  the  observed  deviations  from  the  predicted  mean  (4.18)  lie 
between  these  two  values,  or,  as  shown  in  Fig.  42,  half  the  area 
of  the  curve  lies  between  these  limits.  (See  Chapter  XII  for 
further  interpretation  of  probable  error.) 

As  might  be  expected,  the  value  of  the  prediction  becomes 
better  as  the  correlation  increases.  When  r  =  1,  the  standard 
error  of  estimate  is  zero  and  the  prediction  is  perfect  in  the 
sense  that  all  the  observed  values  lie  on  a  straight  line.  For  a 
list  of  cautions  to  be  observed  in  using  regression  equations, 
see  Chapter  XV,  section  4. 
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The  meaning  of  the  term  ''regression,"  which  is  due  to  Sir 
Francis  Galton,  may  be  shown  in  the  case  of  the  Kne  for  pre- 
dicting the  height  of  sons  from  the  height  of  their  fathers. 
The  equation  of  this  Hne  is  approximately 

S  =  .5F  +  34''  (Ms  =  Mf  =  68'0 

where  S  represents  the  son's  height  and  F  the  father's  height  in 
inches.  By  substituting  a  few  values  of  F  near  the  mean  we  find 
the  results  which  are  listed  in  the  following  table. 


s 

F 

S  -  F 

64 

60 

+  4 

65 

62 

+  3 

66 

64 

+  2 

67 

66 

+  1 

Mean  68 

68 

0 

69 

70 

-1 

70 

72 

-2 

71 

74 

-3 

72 

76 

-4 

The  column  S  —  F  shows  regression,  or  the  tendency  of  the 
son's  predicted  height  to  be  nearer  the  mean  than  the  height  of 
his  father.  Thus  a  father  74  inches  tall  will  be  expected  to  have 
a  son  only  71  inches  in  height,  while  a  father  62  inches  tall  will 
most  probably  have  a  son  65  inches  in  height,  the  son's  height 
each  time  regressing  toward  the  mean  of  the  race.  This  is  one 
of  the  important  laws  of  inheritance. 

Galton's  term  ''regression"  continues  to  be  used  for  any  sort 
of  curve  which  fits  the  means  of  the  arrays  in  a  correlation  table, 
even  though  no  problem  of  inheritance  is  to  be  considered. 


6.  The  Interpretation  of  the  Correlation  Coefficient 

In  the  case  of  perfect  correlation  between  two  variables  the 
association  is  regarded  by  some  writers  as  a  causal  one,  and  a 
smaller  degree  of  correlation  as  an  approximation  to  causal 
relationship.    It  is  usually  best,  however,  to  avoid  this  inter- 
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pretation  and  to  regard  the  correlation  coefficient  merely  as  a 
mathematical  expression  for  the  degree  of  association  between 
the  traits,  regardless  of  the  factors  producing  the  result.  This 
may  be  illustrated  by  the  correlation  between  height  and  score 
on  an  intelligence  test  with  a  group  of  pupils  from  Grades  III 
to  VII.  The  correlation  found  was  .71,  but  it  would  be  absurd 
to  say  that  the  physical  growth  caused  the  mental  growth,  or 
vice  versa.  The  relationship  observed  was  largely  due  to  a  third 
factor,  age,  which  when  eliminated  reduced  the  amount  of  the 
correlation  to  only  —  .06. 

All  statistical  data  are  affected  by  a  multiplicity  of  factors 
which  may  obscure  the  meaning  of  the  relationship  found  be- 
tween two  observed  variables.  For  example,  the  correlation 
between  high-school  and  university  grades  found  on  page  161 
was  .612,  a  result  doubtless  due  in  a  large  measure  to  the  men- 
tality of  the  student.  Many  other  factors,  however,  such  as  his 
age,  sex,  nationality,  health,  ambition,  methods  of  study,  regu- 
larity of  attendance,  and  personal  appearance,  as  well  as  the 
type  of  examinations  and  reaction  of  the  instructors,  doubt- 
less contribute  also  to  the  observed  correlation.  Scholarship  as 
measured  by  marks  is  thus  a  variable  made  up  of  a  large  num- 
ber of  other  variables,  and  the  correlation  found  is  of  doubtful 
meaning  so  far  as  causes  are  concerned ;  its  main  value  here  is 
for  predicting  university  grades  from  high-school  grades  regard- 
less of  the  factors  affecting  such  estimates. 

With  standardized  tests  it  is  possible  to  obtain  results  which 
give  a  better  approximation  to  the  correlation  between  simple 
variables.  The  tests  themselves  measure  fairly  well  certain 
aspects  of  human  abilities  such  as  rate  in  reading,  accuracy  in 
arithmetic,  and  quality  in  handwriting.  Proper  methods  of 
administering  and  scoring  the  tests  will  eliminate  to  a  large 
extent  errors  of  the  observer,  while  such  outstiinding  factors  as 
age,  sex,  grade,  and  nationality  may  be  controlled  by  selection 
of  the  cases.  The  correlation  between  rate  and  comprehension 
in  reading  on  a  certain  test  for  fifty  ])upils  aged  twelve  and  in 
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the  seventh  grade  has  therefore  a  good  deal  more  meaning 
than  the  correlation  for  scholarship  quoted  above. 

One  rather  common  interpretation  of  correlation  is  that  it 
shows  the  percentage  of  agreement  between  the  associated  traits. 
Thus  a  coefficient  of  .90  would  show  90  per  cent  agreement, 
while  a  coefficient  of  .45  would  show  45  per  cent  agreement. 
This  interpretation  is  entirely  misleading  since  the  intensity  of 
association  does  not  vary  directly  as  the  size  of  the  correlation 
coefficient. 

Another  custom  in  dealing  with  correlation  is  to  classify  the 
coefficients  as  ''high,"  ''medium,"  or  "low."  Thus  .75  would 
generally  be  regarded  as  "high,"  while  .25  would  be  considered 
as  "low."  This  terminology  may  be  convenient  in  dealing  with 
test  material  where  the  percentage  of  coefficients  above  .75  and 
below  .25  is  small,  but  may  be  misleading  when  dealing  with 
other  types  of  data.  In  an  age-grade  table,  for  example,  a 
correlation  of  .75  would  be  found  by  comparison  with  similar 
coefficients  to  be  relatively  low.  Another  misconception  some- 
times occurs  in  interpreting  a  "high"  coefficient,  such  as  .7,  as 
meaning  almost  perfect  agreement.  How  far  this  is  from  the 
truth  may  be  seen  by  mere  inspection  of  the  scatter  diagram 
for  values  of  this  size. 

An  interpretation  of  the  correlation  coefficient  that  is  of  some 
theoretical  interest  may  be  illustrated  by  a  problem  in  dice 
throwing  known  as  Weldon's  experiment.*  Twelve  dice  were 
shaken  in  a  box  and  thrown  again  and  again,  the  number  of  dice 
showing  four  or  more  spots  on  the  upper  face  being  recorded. 
When  the  results  of  the  first,  third,  fifth,  etc.  throws  were 
paired  against  the  results  of  the  second,  fourth,  sixth,  etc. 
throws,  no  correlation  was  found  because  all  the  events  were 
quite  independent  of  one  another. 

Next,  half  of  the  dice  were  stained  red  arid  after  throwing 
them  all  and  counting  all  those  showing  four  or  more  spots, 

*  William  Brown,  Essentials  of  Mental  Measurement,  p.  78.  Cambridge  Uni- 
versity Press,  England,  1911. 
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the  second  and  every  alternate  throw  thereafter  was  made  by 
leaving  the  red  dice  upon  the  table,  but  counting  both  colors 
when  computing  the  score.  The  number  due  to  the  red  dice 
was  thus  common  to  the  two  scores.  By  continuing  in  this  way, 
two  series  of  odd  and  even  throws  were  formed  for  which  the 
correlation  approached  the  value  6/12,  or  .5.  In  more  general 
terms,  if  n  is  the  total  number  of  dice  thrown,  and  c  the  number 
common  to  the  pairs  of  throws,  the  expected  correlation  will  be 

c  * 
r  =  -• 
n 

The  correlation  coefficient  may  thus  be  regarded  as  the  ratio 
of  the  number  of  equally  effective  elements  which  two  variables 
have  in  common  to  the  total  number  of  independent  elements 
constituting  each,  or,  more  briefly,  as  the  proportion  of  common 
elements  or  causes.  It  is  hardly  necessary  to  add  that  this  in- 
terpretation is  little  more  than  suggestive  in  dealing  with  ordi- 
nary statistical  data  where  systems  of  causation  are  extremely 
complicated. 

A  final  interpretation  of  correlation  arises  from  a  considera- 
tion of  the  standard  error  of  estimate,  aywl  —  r-.  This  quan- 
tity, as  already  noted,  gives  the  error  in  prediction  by  use  of  a 

single  score  with  the  regression  equation  y  =  r  —  x.    In  case 

r  =  0,  the  prediction  has  a  standard  error  which  is  equal  to  the 
standard  deviation  of  the  predicted  variable  and  is  therefore  no 
better  than  that  which  would  be  obtained  by  selecting  a  value 
of  y  at  random  from  the  obser^-ed  distribution.  As  r  becomes 
larger,  however,  the  predictive  value  of  a  single  score  becomes 
better  than  that  afforded  by  such  a  chance  estimate,  the  im- 
provement being  measured  in  percentage  terms  by 

f  Improvement^ 

=  100(1  -  Vr^).  i  °^^^^'^';^"^«J"  i(47) 
^  ^       prediction  by  '  ^     '' 


:j 


ta  single  score 
*  For  proof  see  William  Brown,  Essentials  of  Mental  Measurement,  p.  79. 
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As  an  example,  it  may  be  noted  that  a  correlation  coefficient 
of  .80  gives  a  value  of  Ip  =  40,  which  means  that  the  regression 
forecast  with  a  single  score  is  here  only  40  per  cent  better  than 
a  random  guess. 

Interpretations  of  this  sort  may  be  useful  in  dealing  with 
correlations  as  a  basis  for  linear  prediction,  but  they  are  likely 
to  give  a  false  impression  of  the  intensity  of  association  given 
by  the  correlation  coefficient  itself.  Thus  a  coefficient  of  .80 
may  be  rather  unsatisfactory  for  estimates  by  a  single  score, 
but  the  value  Ip  =  40  can  hardly  lead  us  to  regard  .80  as  a 
''poor"  correlation  with  test  scores,  as  one  writer  suggests. 

It  should  also  be  noted  that  the  interpretation  afforded  by 
formula  (47)  is,  like  the  preceding  attempts,  quite  arbitrary. 
A  very  different  but  equally  valid  percentage  measure  could 
be  easily  derived  by  considering  the  squared  error  of  estimate 
instead  of  the  first  power. 

The  writer  is  of  the  opinion  that  ''layman's  interpretations '' 
of  correlation  coefficients  should  ordinarily  be  avoided.  Such 
attempts,  as  already  pointed  out,  are  usually  based  on  arbitrary 
assumptions  and  may  furnish  quite  misleading  and  inconsistent 
results  in  the  hands  of  the  "layman."  The  student  of  statistics 
will  soon  find  that  he  needs  no  such  devices  and  his  best  and 
most  useful  guide  in  interpreting  correlations  will  be  given  by 
a  simple  scatter  diagram  with  fitted  regression  curves.  Com- 
parisons between  correlation  coefficients  should  always  be  made 
through  the  use  of  the  sampling  errors  discussed  in  Chapter  XIII. 

7.  Some  Uses  of  Correlation  in  Evaluating 
Test  Material 

In  the  preparation  of  standardized  test  material  several  terms 
have  come  to  be  accepted  with  very  definite  nieanings.  One  of 
these  is  the  reliability  of  the  test,  which  may  be  defined  as  the 
consistency  with  which  the  test  measures  what  it  purports  to 
measure.  An  index  of  this  consistency  is  given  by  the  reliability 
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coefficient,  which  is  merely  the  correlation  between  two  forms  of 
the  same  test  given  at  different  times.  If  Xi  and  Xi  be  the  two 
test  forms,  this  coefficient  may  be  expressed  as  ri/. 

Another  characteristic  of  a  test  is  its  validity,*  by  which  is 
meant  the  extent  to  which  it  does  measure  what  it  purports  to 
measure.  The  evidence  in  this  case  must  of  course  be  indirect. 
It  is  customary  to  select  some  criterion  C,  which  is  known  to 
index  the  trait  in  question.  The  correlation  Tcx  between  the 
criterion  and  the  test  therefore  furnishes  numerical  evidence  of 
the  validity  of  the  latter. 

Suppose,  for  example,  that  five  tests  are  proposed  as  measures 
of  intelligence.  By  correlating  the  results  of  each  of  these  with 
the  scores  on  some  accepted  scale  such  as  the  Stanford-Binet, 
a  series  of  validity  coefficients  of  the  form  ra  =  .78,  rc2  =  .82, 
rc3  =  .40,  rci  =  .78,  and  res  =  .43  might  be  obtained.  Tests  Xi, 
X2,  and  X4  would  thus  be  regarded  as  considerably  more  valid 
than  the  other  two.  In  case  it  is  objected  that  high  correlation 
is  no  sure  evidence  that  the  tests  are  measuring  the  same  thing 
as  the  criterion,  it  may  be  argued  that  this  is  the  best  evidence 
we  have  and  that  such  correlation  shows  that  the  tests  have  high 
predictive  value,  which  is  sufficient  justification  for  their  use. 

In  case  a  number  of  similar  tests  are  pooled  the  reliability 
and  validity  coefficients  of  the  lengthened  test  may  be  obt^iined 
by  applying  Professor  Spearman's  f  theorem  on  the  correlation 
of  sums  and  differences.    These  new  formulas  will  be  derived 

X 

directly,  however,  makmg  use  of  standard  scores  such  as  2;  =  - 
(see  Chapter  VII). 

Let  2:1  =  —  and  z'l  =  —rhe  the  standard  scores  on  two  similar 

(Tl  (T  1 

Xt  x'  I 

tests,  and  let  Zi  =  —  and  z'l  =  '—r  be  the  standard  scores  on 

(7/  a  I 

*  Instead  of  using  the  term  "validity"  some  test  workers  prefer  to  speak  of  the 
"predictive  value"  of  a  test.  This  is  essentially  the  same  property  as  validity,  inas- 
much as  both  are  measured  by  correlating  the  test  with  some  criterion. 

t  C.  Spearman,  "Correlations  of  Sums  and  DilTerences,"  British  Journal  of 
Psuchology,  Vol.  V  (1913),  p.  417. 
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comparable  forms  of  each.  The  problem  is  to  determine  the 
reliability  of  zi  +  z\,  knowing  the  reliability  of  Zi.  This  will  be 

given  by  working  out 

l^ZiZi  +  ^ZiZ'i  -\-  l^z\zi  +  l^z'iz'i 

'  ^      ^  Naiz,  +  z\)a(zj  +  z'j) 

which  comes  from  expanding  r  =  ,,    ^   when  x  =  Zi-\-  z'l  and 

y  =  Zi-\-z'i,  The  standard  deviations  o-^  +  g'p  and  (Ti^^+z'p  re- 
duce to  ■\/2  +  2ri/,  since  (Tz,  =  (Tz\  =  (Tz^  =  (Tz'j  =  1.  All  the  cor- 
relations between  the  z's  are  equal  to  ru.   We  therefore  have 

2ri7 
r  (2i  +  z\)  {zj  +  z'l)  =  j^XT^  * 

By  induction  it  may  be  easily  shown  that  by  increasing  a  test 
n-fold  with  similar  material  the  cumulative  reliability  coefficient 
is  given  by 

TLTii ( Spearman-Brown  formula  for  predicting )^  /aq\ 

'""  "~  1  -|.  (^  _  1);-^^  *  \  reliability  of  lengthened  tests  j   ^^^^ 

This  expression  is  often  called  the  Spearman-Brown  prophecy 
formula,  since  it  was  proved  independently  by  both  men. 

As  an  example  let  us  assume  that  a  test  with  a  reliability 
coefficient  of  .7  has  been  prepared.  What  will  the  reliability  be 
when  the  test  has  been  made  three  times  as  long  by  the  addition 
of  similar  material  ?  The  answer  is  found  by  substituting  n  =  Z 
and  ru  =  .7  in  equation  (48),  giving 

^^'^      =  .875. 


1  +  2  X  .7 


An  empirical  check  on  this  formula  was  made  by  Miss  Blythe 
Clayton*  and  the  writer,  with  carefully  graduated  spelling 
material.  Seven  equally  difficult  tests  with  parallel  forms  were 
given  and  the  results  of  actual  pooling  compared  with  those 

*  Karl  J.  Holzinger  and  Blythe  Clayton,  "Further  Experiments  in  the  Applica- 
tion of  Spearman's  Prophecy  Formula,"  Journal  of  Educational  Psychology  (May, 
1925),  Vol.  XVI,  pp.  289-299. 
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predicted  by  the  formula.    The  close  agreement  between  ob- 
served and  theoretical  values  is  shown  in  the  following  table. 

Table  34.  Observed  and  Predicted  Reliability  Coefficients  for 
Successive  Pools  of  n  Equally  Difficult  Spelling  Tests 


Number  op  Tests  Pooled  =  n 

Observed  Reliability 
Coefficient 

Theoretical 

Coefficient  from 

Formula  (48) 

1 

.743 

.841 
.906 
.916 
.941 
.949 
.955 

.743 

?. 

.853 

8 

.897 

4 

.920 

5 

.936 

6 

.945 

7 

.953 

The  formula  for  the  validity  of  n  pooled  tests  may  be  ob- 
tained by  working  out 


^c(Zi  +  Zo  +  Zj  +  •  '  •  +  Zn). 


For  three  such  tests  we  shall  have 

Tc(Zi   +  Z2  +  23)    ~  AT_      _ 

Substituting  the  values  for  llcz  and  (T{z^  +  .3  +  23),  there  results 

Tcz,  +  Tczo  +  T^cz-, 


or 


fc{Zi  +  Zo  +  2.3)    — 

Tc(Zi  +  Z2  +  z_i)  ^= 


V3  +  27~+27^X+2r^3 

STcz 


(49) 
(50) 


V3  +  6  Tzz 

if  the  validity  coefficients  and  correlations  r,,  are  equal.    By 
induction  it  may  now  be  shown  that  for  n  tests  we  have 

nrcz 


Ten  = 


r  Formula  for  predicting  validity  1     . 

Vt?  +  n(n  —  l)rzz   ^  °^  lengthened  tests  J    ^     ^ 

A  test  with  a  reliability  of  .7  and  a  validity  coefficient  of 
.6  would,  upon  being  made  three  times  as  long,  have  a  validity 


coefficient  of 


3x  .6 

V3  4-  6  X  .7 


=  .67. 
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An  interesting  agreement  between  psychological  theory  and 
statistical  analysis  may  be  noted  in  connection  with  formula 
(49).  Suppose  several  tests  are  to  be  pooled  for  the  measure- 
ment of  intelligence.  Some  psychologists  would  select  tests 
which  correlate  high  with  a  criterion  and  low  amongst  them- 
selves so  as  to  obtain  not  only  those  which  are  most  valid  but 
also  those  which  measure  as  wide  a  sampling  of  intellectual 
abilities  as  possible.  Psychological  theory  would  then  require 
a  pool  of  tests  with  high  values  r^^,,  rczo,  Tcz^,  etc.,  and  low  values 
r2^2,_j  etc.  Very  fortunately,  such  a  combination  will  produce  a 
high  validity  coefficient  for  the  combined  tests,  as  may  be  seen 
from  formula  (49).  High  coefficients,  r^z,  will  give  a  large  nu- 
merator, and  low  coefficients,  Vzz,  will  give  a  small  denominator, 
both  acting  in  the  same  direction  to  produce  a  high  validity 
coefficient  for  the  pooled  tests. 

Another  interesting  application  of  correlation  occurs  in  con- 
nection with  the  scoring  of  multiple-choice  tests.  The  general 
formula  advocated  to  correct  for  the  element  of  guessing  is 

S  =  R-^-^^W  =  R-CW,     I  Multiple-response  I 

(72  —  1)  ^      I  scoring  formula   J        ^     ^ 

where  S  is  the  score,  R  the  number  of  right  responses,  W  the 
number  of  wrong  responses,  C  a  constant,  and  n  the  number  of 
choices.  Thus  if  the  examinee  is  to  underline  one  of  three  sug- 
gested answers,  he  would  be  scored  by  the  formula  S  =  R  —  ^W. 
Such  complicated  scoring  methods  may  be  avoided  entirely 
if  all  pupils  be  allowed  to  finish  the  test.  In  this  case,  if  A  =  the 
number  of  attempts,  R  -\-  W  =  A  =  r  constant.  We  may  also 
write  S  =  R-\-  C{R  —  A),  or  S  =  aR-\-h  where  a  and  h  are  con- 
stants. The  correlation  between  S  and  R  will  now  be  perfect, 
so  that  the  number  of  ''rights"  furnishes  as  reliable  and  valid  a 
score  as  the  full  formula.  The  proof  that  Tsr  =  +  1.00  is  left  as 
an  exercise  for  the  student.    (See  Exercise  8.) 
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8.  The  Effect  of  Selection  upon  Correlation 

If  the  correlation  between  two  traits,  Xi  and  Xo,  is  given  by 
ri2  with  a  sample  of  .V  and  selected  values  are  chosen,  reduc- 
ing the  size  of  the  sample  to  .V  —  n,  then  the  resulting  correla- 
tion Ri2  \^'ill  differ  from  that  obtained  for  the  unselected  group. 

Professor  Pearson  *  has  shown  that  if  ai  denotes  the  variabil- 
ity in  A'l  before  selection  and  Zi  the  variability  after  selection, 
then 

2)i  ri9  f  Correlation^ 

Rl2  = ,  -i  after  selec-  [  W*^/ 


v^ 


^2     I    v2    /  -i\       L       tion       J 


The  correlation  R12  decreases  with  Zi,  so  that  restricting  the 
range  of  Xi  lowers  the  original  correlation. 

As  an  example,  let  us  assume  that  the  correlation  between  two 
traits  is  given  by  /•12  =  .7,  and  that  values  of  Xi  are  taken  so 
that  (Ti  =  10  is  reduced  to  Zi  =  5.  Substituting  these  values 
in  equation  (^53),  we  find  Ri-2  =  .44.  which  is  considerabh'  less 
than  the  correlation  before  selection. 

In  case  there  is  selection  in  both  variables  the  adjustment 
formulas  t  become  ver>'  complicated.  The  beginning  student 
will  do  well  to  avoid  problems  invoh^ing  such  coiTection  until 
he  is  in  a  position  to  read  the  papers  cited  in  footnotes  below. 

9.  The  Effect  of  R.\nge  of  T.\lent  upon  Correl.\tion 

The  magnitude  of  correlation  coefficients  clearly  depends 
upon  the  particular  group  studied.  Thus,  ''to  secure  a  reliabil- 
ity coefficient  of  .40  from  a  group  composed  of  children  in  a 
single  grade  is  probably  indicative  of  greater,  not  less,  reliability 
than  to  secure  a  reliability  coefficient  of  .90  from  a  group  com- 

*  Karl  Pearson.  "On  the  Influence  of  Natural  Selection  on  the  Variability  and 
Correlation  of  Organs,"  Philosophical  Transactions  of  the  Royal  Society  of  London, 
Series  A..  Vol.  CC.  p.  23. 

t  Karl  Pearson.  "On  the  Influence  of  Double  Selection  on  Variation  and  Cor- 
relation of  Two  Characters,"  Biometrika,  Vol.  VI  (1908). 
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posed  of  children  from  the  second  to  the  twelfth  grades,*'  as 
shown  by  Professor  Kelley.* 

This  difference  in  the  value  of  the  obtained  correlation  is  due 
to  what  Kelley  calls  ''range  of  talent/'  and  he  has  given  a  formula 
(seeequation  116  )  for  adjusting  coefficients  for  varying  ranges  of 
talent.  The  proof  of  the  formula,  however,  is  open  to  some  ob- 
jections and  it  is  probably  better,  therefore,  to  compare  correla- 
tion coefficients  only  when  they  have  been  obtained  from  the  same 
group  or  from  groups  varying  but  slightly  in  range  of  talent. 

As  a  general  caution  it  may  be  noted  that  it  is  not  safe  to 
compare  correlation  coefficients  of  any  sort  obtained  from 
groups  where  the  range  of  talent  or  other  conditioning  factors 
such  as  range  in  age  are  very  different  (see  Chapter  XV). 

EXERCISES 

1.  Make  correlation  tables  for  Otis  with  Terman  and  for  Chicago 
with  Terman  tests  from  the  data  of  Exercise  1,  Chapter  II.  Use  inter- 
vals of  69.5-79.5  etc.  for  Otis  and  Terman,  and  29.75-.34.75  etc.  for 
Chicago.  Work  out  the  coefficients  of  correlation. 

(''or  =  -^^^ »  ^CT  =  -681.  Ans^ 

2.  Make  a  correlation  table  for  the  two  spelling  tests  of  Exercise  6, 
Chapter  II,  using  Lnter\-als  of  5  units  for  both  tests.  Work  out  the 
correlation  coefficient,  (r  =  .963.  AnsJ) 

3.  Compute  the  means  of  the  columns  and  the  means  of  the  rows 
from  the  table  of  Exercise  2,  and  plot  them  on  graph  paper.  Calcu- 
late the  equations  of  the  two  regression  lines  and  plot  on  the  same 
graph.   Determine  the  two  probable  errors  of  estimate. 

A  =  I.OIB  -  2.28  ±  4.04  ;  B  =  .92 A  +  6.63  ±  3.85.  Avis,) 

4.  Calculate  the  correlation  coefficient,  regression  lines,  and  prob- 
able errors  of  estimate  for  the  table  on  page  174.  Compute  the 
means  of  the  columns  and  rows  and  plot  ^ith  the  regression  lines  as 
Ln  Exercise  3.   The  values  of  the  constants  are  giv^n  below  the  table. 

5.  Compute  the  correlation  coefficient  for  the  table  on  page  175. 

*  T.  L.  Kelley,  **The  RelabOity  of  Test  Scores."  Jomrmal  o/  Bdmeaiiomal  Re- 
Mttrck,  May.  1921.  p.  374. 
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6.  Work  out  the  correlation  coefficient  for  the  first  25  pairs  of 
scores  for  the  data  of  Exercise  2,  using  the  method  described  in  sec- 
tion 3.  Compare  the  amount  of  arithmetic  with  that  involved  in  the 
use  of  the  correlation  table. 

7.  The  experimental  reliability  coefficients  found  by  lengthening  a 
spelling  test  from  one  to  ten  times  the  original  value  were  :  .850,  .903, 
.927,  .946,  .960,  .970,  .974,  .976,  .980,  and  .981.  Calculate  the  corre- 
sponding theoretical  coefficients  from  the  Spearman-Brown  formula, 
using  r\i  =  .847  and  n  =  1, 2, 3  •  •  '10  successively.  (Data  furnished  by 
Professor  G.  M.  Ruch.) 

(.847,  .917,  .943,  .957,  .965,  .971,  .975,  .978,  .980,  .982.   Am.) 

8.  Work  out  the  proof  for  the  exercise  suggested  at  the  end  of  sec- 
tion 7. 

9.  Prove  that  the  correlation  between  aX  +  h  and  cY  -\-d  is  the 
same  as  the  correlation  between  X  and  Y  where  a,  6,  c,  and  d  are 
constants. 


CHAPTER  X 


NON-LINEAR  CORRELATION 


1.  The  Correlation  Ratio 


Fig.  43.  An  extreme  case  of  non-linear 
correlation 


As  pointed  out  in  Chapter  IX,  when  the  means  of  the  arrays 
do  not  lie  fairly  closely  on  a  straight  line,  the  regression  is  to 
be  regarded  as  non-linear.  The  correlation  coefficient,  which 
measures  the  approach  to  functionality  only  when  the  traits 
have  a  linear  relationship, 
will  give  an  understatement 
of  the  degree  of  association 
present  for  such  cur\ilinear 
trends  and  is  therefore  inap- 
plicable. An  extreme  case 
of  this  understatement  is 
illustrated  in  Fig.  43,  where 
all  the  obser\-ations  lie  on  a 
half  circle.    The  correlation 

as  defined  by  approach  to  functionality  will  be  perfect,  but 
the  product-moment  coefficient  will  give  zero  as  the  amount  of 
association.  This  may  be  readily  verified  by  noting  that  from 
the  s>Tnmetry  of  the  points,  Hxy  will  equal  zero. 

In  order  to  measure  the  correlation  for  non-linear  tables, 
Professor  Pearson  has  de\ised  a  coefficient  knovvm  as  the  corre- 
lation ratio.  The  meaning  of  this  coefficient  may  be  shown  by 
returning  to  formula  (37)  for  the  standard  error  of  estimate 
of  y  on  X.    Rearranging  the  terms  in  this  formula,  we  have 

2  _  ^        Sy^      J  Correlation  coeffi-1      (v\&\ 
~  o^ '     L  cient  in  ratio  form  J 

where  Sy  is  the  standard  de\'iation  of  the  differences  y  —  y,  or 
residuals  from  estimation  by  the  regression  line  y  =  mx.   The 


i  i 
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coefficient  of  correlation  was  derived  on  the  assumption  that 
the  means  of  the  columns,  y^y  and  the  means  of  the  rows,  Xy,  lie 
on  their  respective  regression  lines. 

In  case  the  means  of  the  columns,  y^,  do  not  lie  on  a  straight 
line,  the  residuals  y  —  y  may  be  replaced  by  the  differences 
y  —  Vx,  whose  standard  deviation  is  denoted  by  day.  The  correla- 
tion ratio  for  the  means  of  the  columns  may  then  be  defined  as 


and,  for  the  means  of  the  rows,  as 


(55) 


^xy  = 


=  >/^- 


r  Correlation  ratios, 
\     original  form 


(56) 


From  Fig.  44  it  is  apparent  that  the  differences  y  —  yx  and 
their  standard  deviation  day  measure  the  extent  to  which  the 

points  in  the  scat- 
ter diagram  are  con- 
centrated about  the 
irregular  regression 
curve.  When  all  the 
points  in  the  diagram 
are  located  at  the 
column  means,  the 
differences  y  —  yx  and 
(Tny  will  be  zero,  giv- 
ing r)yj,  =  l;  but  when 
there  is  any  scatter 
in  the  arrays,  a^y  will 
1.     The  correlation 


^P(x.y)  1 

J) 

\y-u  ^ 

/ 

/A 

2/x 

/. 

y^^  y\ 

My 

/" 

^ 

M^ 

Fig.  44.   Illustrating  the  correlation  ratio 

not  be  zero,  and  7)yx  will   be  less  than 


ratio  thus  measures  the  approach  of  the  data  to  any  single- 
valued  function,  while  the  correlation  coefficient  indicates  the 
closeness  to  linear  functionality. 

It  is  further  evident  that  if  the  regression  is  linear,  y  —  y 
=  y  —  Vx  for  all  the  columns,  so  that  Sy  =  aayy  and  r  =  rjy^-  If  the 
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regression  is  not  linear,  yx  =  y  dz  d,  or  y  —  yx::t  d  =  y  —  y. 
Squaring  both  members  of  this  last  expression,  summing  over 
the  whole  table,  and  introducing  f^  as  a  symbol  of  operation, 
we  have 

2/x(2/  -  y.y  ±  2  2/,(i/  -  yx)d  +  i:W  =  ZMy  -  y)^. 

Since  the  middle  term  on  the  left  is  zero  for  each  column,  this 
reduces  to  2    ,     2      02  /c*r\ 

By  combining  equations  (54),  (55),  and  (57),  we  finally  obtain 

2/'ti  2  2\  —     2     '' Relation  between  correla- "1    /p.o\ 

^yV\yx  -  ^  )  -  ^d'  Ition  coefficient  and  ratio  j    ^^^^ 

This  proves  that  riyj,  is  always  greater  than  or  equal  to  r,  since 
ad  is  a  positive  quantity.  The  same  reasoning  might  of  course 
be  applied  to  77^^. 

From  the  above  discussion  it  is  apparent  that  the  single 
measure  of  association  furnished  by  the  correlation  coefficient 
may  be  replaced  by  the  two  correlation  ratios  which  are  always 
numerically  greater  than  r.  The  correlation  coefficient  fails  to 
measure  the  full  amount  of  association  in  case  the  regression  is 
not  linear,  and  should  not  be  used  unless  the  departure  from 
linearity  is  negligible  (see  section  4). 


2.  Modified  Formulas  for  the  Correlation  Ratios 

Formulas  which  are  more  convenient  for  computation  may  be 
obtained  by  modifying  equations  (55)  and  (56)  and  introducing 
the  methods  and  notation  of  Chapter  IX.  The  quantity  y  —  y^ 
when  squared  and  summed  over  a  column,  gives 

^\y  -  y.Y  =  ^'if  -  2  i:'iM  +  ^%'  =  ^Y  -f.V.^ 

where  the  primes  denote  summation  over  a  column. 
Summing  next  over  the  whole  table,  we  find 
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Denoting  the  standard  deviation  of  y^  by  o-^^,  we  then  have 

If  this  result  is  substituted  in  equation  (55),  we  obtain 

%■«  ~  "^      r  Correlation  ratios  as  ^     v"^) 
s  quotients     of     two  [- 
and,  similarly,  t^^^  =  ^.    I  standard  deviations;     (gi) 


X 


The  correlation  ratio  is  thus  the  quotient  of  the  standard 
deviation  of  the  means  of  the  arrays  divided  by  the  standard 
deviation  of  the  whole  table.  It  should  be  noted  that  in  forming 
ay^,  the  deviations  are  weighted  by  the  frequencies  of  the  arrays. 

We  shall  next  modify  formulas  (60)  and  (61)  so  that  the  cal- 
culations may  be  carried  out  with  the  variables  taken  from 
arbitrary  origins  as  in  the  formulas  of  Chapter  IX. 

The  first  formula  may  be  written 


^1 


N 


(  Correlation  ratio 


1 


T|yjj.  = )  -l  for  means  of  col-  >■  (62) 

^y  I  umns  J 

where  My=Ay+  S^Ali 

and  Y,  =  Ay+  ^^^MuA. 

Jx 

We  therefore  have 

k  N  f, 

and 

Summing  over  the  columns  and  then  over  the  whole  table, 
we  find 
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since  i:i:'f-^=i:fydy. 


Upon  collecting  terms  we  may  write 

r(2%A)!i_ 


k^  ~      I        L        \~       N      '" 

and,  similarly. 


It  has  already  been  shown  in  Chapter  IX  that 

Substituting  these  results  in  formula  (62),  we  have 

^,,  =  J  (63) 

^  ^     J  Correlation  ratios  for"! 
/^    \     correlation  blank     j 
and,  similarly,  '^xy  =  ^^-'  (64) 

These  last  formulas  are  especially  convenient  when  the  correla- 
tion coefficient  and  ratios  are  to  be  compared  as  is  often  neces- 
sary. The  quantities  b  and  c  and  the  corrections  for  d  and  e  are 
obtained  in  working  out  the  correlation  coefficient.  The  re- 
mainder of  the  computation  for  d  and  e  may  be  readily  done 
with  the  aid  of  the  special  correlation  sheet  shown  in  Table  35. 

3.  A  Combination  Form  for  the  Correlation  Coefficient 

AND  Ratios 

In  Table  35  the  full  computation  of  the  correlation  ratios  is 
shown.  It  will  be  noted  that  this  form  differs  from  the  one 
shown  in  Table  31  in  that  two  additional  columns  and  rows  for 
the  calculation  of  the  ratios  are  given. 

For  the  items  on  the  row  headed  {H'fxydyY  it  is  only  neces- 
sary to  square,  by  means  of  tables,  the  quantities  (L'fxydy) 
already  found  in  calculating  r.    In  the  last  row  these  squared 
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items  are  divided  by  the  corresponding  frequencies  /x,  the  total 


sum 


being  2:rgy^)' 


The  correction  to  this  last  quan- 


tity  is  ^  ^^^  y  also  known  from  previous  work.   Making  the 

correction,  we  obtain  e  =  10,827,  and  by  a  similar  calculation  for 
the  rows  we  determine  d  =  11,817.  The  remainder  of  the  com- 
putation may  be  readily  done  with  the  aid  of  logarithms  as 
illustrated  on  the  sheet. 

It  will  be  noted  that  the  computation  for  ^f^yd^dy  has  been 
checked  by  working  out  this  quantity  from  both  the  columns 
and  rows.  Other  important  checks,  such  as  2[2%yC?2;]  =  '^fydy, 
should  be  noted  on  the  sheet  and  carefully  observed  in  the  cal- 
culations. 

The  value  for  r]yx  comes  out  as  .6191,  while  ry^y  is  .6401.  The 
former  ratio  is  in  close  agreement  with  the  correlation  coeffi- 
cient, r  =  .6120,  on  account  of  the  linearity  of  the  means  of  the 
columns.  The  coefficient  r]jy,  however,  is  somewhat  larger  than 
r  because  of  the  irregular  regression  curve  for  the  rows. 

In  case  the  means  of  the  arrays  are  required  for  plotting  they 
may  be  readily  found  by  use  of  the  formulas 

Xy  =  Ax-\-  I — f-^)h     r  Means   of  the"]    (65) 
^  ^  <  arrays  in  a  cor-  1^ 

and  Yx  =  Ay  +  (^^)k     I  relation    table  J    (gg) 

where  Ax  and  Ay  are  the  assumed  means  and  h  and  k  the  class 
intervals  for  the  variables  X  and  7,  respectively.  The  quanti- 
ties ll'fxydx  and  H'fxydy  may  be  taken  directly  from  the  correla- 
tion sheet.   For  the  means  of  the  columns,  we  should  thus  have 

780.5  =  2.667  +  =^  X  ^  =  1.75, 

—  —  594      1 

781.5  =  2.667  +  -^  X^=  1.54, 

782.5  =  2.667  +  ^^1^  x\=  1.66,  etc. 
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^  4.  Tests  for  Linearity 

In  order  to  determine  whether  or  not  a  correlation  table  is 
sufficiently  linear  so  that  r  may  replace  ry  as  a  measure  of  asso- 
ciation, a  test  for  linearity  known  as  Blakeman's  test  may  be 
applied.  The  observed  difference  between  rjy:^  and  r  for  the 
data  on  page  182  is  .6191  -  .6120  =  .0071.  With  a  similar  table 
it  appears  quite  likely  that  this  difference  might  be  zero. 

The  tests  of  Blakeman  which  we  shall  use  here  may  be  ex- 
pressed in  the  following  form  :  The  difference  between  rj  and  r  is 
to  be  regarded  as  insignificant,  provided  that 

4  047     , 

^'  -  ''^  <  —7=  V(Tl2  -  r2)  1(1  -  112)2  _  (1  _  r2)2  +  1},      (67) 

^  {Blakeman's  test  for  linearity} 

or,  if  r?^  _  7-2  is  small  in  comparison  with  r, 

ViV  Vil2  -  r2  <  4.047.  (68) 

{Blakeman's  short  test  for  linearity} 

A  full  discussion  of  such  sampling  tests  will  be  found  in  Chapter 
XIII,  but  for  the  present  we  shall  merely  illustrate  the  above 
rules  by  applying  them  to  the  university  and  high-school  corre- 
lations found  in  the  preceding  section. 

For  the  coefficients  rjy:,  and  r  we  have,  upon  substituting  the 
necessary  values  in  formula  (67), 

.00874  <  ^^  V.00874{.989}  =  .00911. 
41.32  ^        ^ 

Using  formula  (68),  we  have 

41.32  X  .0935  =  3.86  <  4.047. 

By  both  tests,  therefore,  the  regression  is  to  be  regarded  as 
linear  and  the  use  of  a  linear  equation  for  predicting  university 
from  high-school  grades  is  justified. 

Applying  formula  (68)  to  rj^y  and  r,  we  tind  ^ 

41.32  X  .1876  -  7.75  >  4.047. 

The  regression  in  this  case  is  non-linear,  and  for  a  full  measure 
of  the  association,  rjxy  must  be  used. 
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For  a  small  number  of  cases  (say  50)  it  is  frequently  impos- 
sible to  determine  with  certainty  whether  or  not  the  regression 
is  linear.  With  small  tables,  therefore,  unless  the  regression  is 
obviously  curved,  the  calculation  of  r  will  be  all  that  is  required. 
With  considerably  larger  bodies  of  data,  however,  the  test  be- 
comes important  and  the  use  of  r  should  be  justified  by  com- 
parison with  both  of  the  correlation  ratios. 

5.  A  Method  of  Eliminating  the  Effect  of  a  Variable 
UPON  THE  Association  between  Two  Others 

If  three  or  more  correlated  variables  are  involved,  the  asso- 
ciation between  two  of  them  for  a  fixed  value  of  the  third  is 


At       'As  A 

Fig.  45.   Illustrating  formula  (69) 


often  required.  In  case  the  regressions  are  all  linear  throughout, 
the  problem  may  be  solved  by  the  use  of  multiple  correlation 
as  shown  in  Chapter  XV,  but  with  non-linear  relationships  the 
solution  becomes  more  difficult. 

The  most  direct  and  the  best  method  for  dealing  with  such 
problems  is  to  correct  the  two  associated  variables  for  values  of 
the  third.  The  method  may  be  illustrated  for  two  variables,  both 
having  non-linear  correlation  with  age.  Designating  these  as 
X,  y,  and  A,  the  correlation  r^y  for  A  eliminated  is  requii^ed. 

It  is  first  necessary  to  prepare  the  correlation  tables  for  X 
with  A  and  Y  with  A  and  determine  the  regression  curves  for 
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y  on  A  and  X  on  A  as  shown  in  Fig.  45.  These  curves  may  be 
drawn  in  free-hand,  or  fitted  by  the  method  of  least  squares  as 
shown  in  Chapter  X  VI .  A  certain  age,  As,  is  then  selected  for  both 
tables,  and  all  the  values  of  X  and  of  Y  are  corrected  to  this  age. 
Let  the  ordinate  of  any  observation  in  the  table  at  age  At 
be  denoted  by  Yt,  and  let  Yt  and  Ys  be  the  mean  values  of  Y 
at  At  and  As  furnished  by  the  regression  curve.  Then  the  re- 
quired value  of  Ys  sd.  As  will  be  given  by  the  relation 

Ys=ys+(Y,-n.  {?:;:^-tr;':}  (69) 

From  Fig.  45  it  will  be  noted  that  this  formula  merely  as- 
sumes that  the  growth  in  Yt  from  At  to  As  is  parallel  to  the 
regression  curve  between 
these  points.  The  corrected 
variable  Yg  is  thus  the  most 
likely  value  that  Yt  will 
have  when  the  individual  has 
reached  the  standard  age. 

The  arithmetic  is  most 
easily  done  by  preparing  a 
table  of  values  Ys  for  all 
ages  and  then  applying  for- 
mula (69)  to  the  observations 

at  each  age  across  the  correlation  table.  Similar  corrections 
may  be  made  for  the  variable  X,  and  all  the  results  recorded 
on  the  tabulation  cards.  The  correlation  between  the  corrected 
values  Ys  and  Xs  then  gives  a  good  approximation  to  the 
result  that  would  have  been  obtained  if  all  the  subjects  had 
been  measured  at  the  same  age,  As. 

In  case  the  standard  deviations  of  the  arrays  of  ages  are  not 
equal,  another  correction  may  be  made.  Equal  variability  of  the 
arrays  across  the  table  is  described  as  homoscedasticity,  and 
unequal  variability  as  heteroscedasticity.  The  new  correction, 
then,  is  for  heteroscedasticity  as  illustrated  in  Fig.  46. 
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Fig.  46.   Illustrating  formula  (70) 
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If  an  individual  at  At  is  one  standard  deviation  above  the 
mean  Yt,  his  most  probable  deviation  at  age  As  will  be  one  stand- 
ard deviation  above  the  mean  at  standard  age.  Denoting  the 
standard  de\4ations  of  the  arrays  at  At  and  As  by  at  and  cr^, 
respectively,  the  corrective  formula  then  becomes 

Y  _y  _L_  (Y  _y^^,*/  Corrective  formula  adjusting)    /tyf\\ 
^~    ^       ^    '         ^*  cr^      \for  ageand  heteroscedasticity  j    ^     ^ 

In  applying  this  formula  it  is  necessary  to  work  out  the  ratios 

—  for  each  age,  multiply  the  result  by  the  corrective  factor 
cft 

{Yt—  Yt)y  and  add  to  the  value  for  7^. 

The  corrected  values  Xs  and  Ys  may  now  be  correlated,  and 
the  result  will  give  the  relationship  between  these  variables  for 
the  age  eliminated.  This  is  essentially  what  is  known  as  a  partial 
correlation  between  X  and  Y  (for  A  fixed).  In  case  the  variables 
X  and  Y  have  linear  regression  with  A,  a  partial  correlation  may 
be  worked  out  by  the  use  of  a  formula  (see  Chapter  XV). 

It  may  finally  be  noted  that  the  regression  curves  for  the  cor- 
rected variables  Xs  and  Ys  may  be  non-linear  and  the  correlation 
ratio  required.  Whatever  measure  of  relationship  is  used,  how- 
ever, the  resulting  association  is  freed  from  the  effect  of  the 
third  variable  A. 

EXERCISES 

1.  Work  out  the  correlation  coefficient  and  the  two  correlation 
ratios  for  the  table  on  page  187.     Apply  the  tests  for  linearity. 

ir  =  -  .828  ;   v^rj  =  .961 ;    v,a  =  .958.   Ans.) 

2  and  3.  Calculate  the  correlation  coefficient  and  ratios  for  the 
tables  on  pages  188  and  189,  and  test  for  linearity. 

4.  Show  that  the  method  for  correction  given  by  formula  (70)  is 
equivalent  to  equating  standard  scores  at  ages  At  and  As. 

*  For  an  illustration  of  the  use  of  this  formula  see  a  paper  by  the  author,  "On  the 
Relation  of  Vital  Capacity  to  Certain  Psychical  Characters,"  Biometrika,  Vol,  XVI, 
p.  139. 
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CHAPTER  XI 

THE  BINOMIAL  DISTRIBUTION 

1.  Introductory 

As  pointed  out  in  the  first  chapter,  the  inductive  side  of  statis- 
tical method  is  based  on  the  theory  of  probabihty.  The  com- 
parison of  results  from  different  samples,  inferences  regarding 
differences,  and  generalizations  of  various  sorts  are  possible 
only  by  resorting  to  the  theory  of  chance. 

So  important  is  this  aspect  of  statistical  science  that  some 
writers  *  devote  practically  all  of  their  treatment  to  the  theory 
of  probability.  For  an  elementary  course  and  for  the  non- 
mathematical  student  such  extensive  treatment  is  impossible. 
We  shall  therefore  be  content  to  present  here  some  of  the 
simplest  ideas  in  this  theory  with  the  understanding  that  the 
student  is  urged  to  amplify  his  knowledge  of  probability  by 
consulting  such  works  as  Keynes,  f  Whittaker,  J  and  Fisher. 

In  the  present  chapter  we  shall  take  up  certain  elementary 
theorems  in  probability  and  discuss  the  chance  distribution 
known  as  the  point  binomial.  Certain  properties  of  this  series 
which  are  important  in  the  theory  of  sampling  will  also  be  con- 
sidered. The  binomial  law  also  serves  as  a  good  introduction 
for  the  normal  probability  curve,  which  will  be  taken  up  in  the 
following  chapter. 

In  order  to  remind  the  student  of  some  of  the  algebra  useful 
in  the  development  of  the  point  binomial  we  shall  tui'n  fii'st  to 
the  theory  of  combinations. 

*  Arne  Fisher,  The  Mathematical  Theory  of  Probabilities.  The  Macmillan  Com- 
pany, second  edition,  1923. 

t  J.  M.  Keynes,  A  Treatise  on  Probability.  The  Macmillan  Company.  1D21. 

t  Whittaker  and  Robinson,  The  Calculus  of  Observations.  D.  Van  Nostrand 
Company,  1924. 
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2.  Permutations  and  Combinations 

Suppose  that  a  group  of  n  objects  is  given.  Any  set  of  r*  of 
these  objects,  without  regard  to  their  order,  is  called  a  combina- 
tion of  the  n  objects  taken  r  at  a  time,  and  is  denoted  by  the 
symbol  nCr-  For  example,  the  combinations  of  the  first  four 
letters  of  the  alphabet  taken  three  at  a  time  are 

dbc  abd  acd  bed 

Since  there  are  four  of  these,  we  may  write  4C3  =  4. 

If  the  order  of  the  objects  be  taken  into  account,  the  arrange- 
ments are  known  as  permutations  and  are  denoted  by  nPr-  Thus 
the  letters  a,  6,  and  c  may  be  arranged  in  a  row  in  the  order 
abCy  aebf  bae,  bea,  eab,  and  eba,  so  that  3P3  =  6.  In  the  case  of 
four  letters,  each  of  the  four  combinations  of  three  furnishes  six 
permutations,  so  that  the  total  number  of  permutations  of  four 
things  taken  three  at  a  time  is  twenty-four,  or  aP^  =  24. 

The  general  formulas  for  permutations  and  combinations  may 
be  shown  to  have  the  forms 

nPr  =  n(n  -  l)(n  -  2)  •  •  •  (n  -  r  +  1)  (71) 

and  ^,^^n(.-l)(n-2y..(n-r+l)^^,  ^^^^ 

The  quantity  r!  is  known  as  ''factorial  r'*  and  means  the 
product  of  all  integers  from  1  to  r. 

It  is  also  shown  in  algebra  that  nCr  =  nCn-r,  so  that  nCn 
=  „Co  =  1.  This  theorem  will  be  needed  in  a  later  section. 

Applying  the  above  formulas  to  four  letters  taken  two  at  a 
time,  we  find 

4P2  =  4  X  3  =  12,    and    4C2  =  f^  =  6. 

♦  This  "  r  "  should  not  be  confused  with  the  correlation  coefficient.  It  seemed  best 
to  retain  this  symbol  in  the  theory  of  combinations  because  of  its  wide  use  by 
mathematicians. 
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These  results  may  be  easily  verified  by  making  all  possible  ar- 
rangements of  four  letters  two  at  a  time, 

ah  ac  ad  he  hd  cd 

ha  ca  da  ch  dh  dc 

The  student  may  also  check  the  following  numerical  results 
by  applying  the  above  formulas : 

6P2  =  30        6C2  =  15        ioP3  =  720        ioC3  =  120 

3.  Elementary  Probability 

If  an  event  may  happen  in  h  ways  and  fail  in  k  ways,  and  if 
each  of  the  h-{-k  ways  is  equally  likely  to  occur,  the  mathe- 
matical probability  *  of  the  event  happening  is 

P  =  t\^  (73) 


and  the  probability  of  its  failing  is 

k 


h  +  k 


(74) 


It  is  evident  that  the  probability  of  an  event  happening  plus 
the  probability  of  its  failing  is  equal  to  1,  which  is  the  mathe- 
matical symbol  for  certainty.  The  above  results  may  also  be 
expressed  by  saying  that  the  odds  are  h  to  k  m  favor  of  the 
event  happening,  or  k  to  h  against  its  occurrence. 

Some  of  the  simplest  examples  of  such  probability  are  fur- 
nished by  the  results  of  penny  and  dice  tossing.  Let  us  assume 
that  the  penny  is  a  homogeneous  disk  and  exclude  the  possibility 
of  its  standing  upon  an  edge  or  sticking  in  a  crack.  If  the  turn- 
ing up  of  the  head  is  regarded  as  a  successful  event  and  the 
turning  up  of  the  tail  as  a  failure,  it  is  evident  that  p  =  q  =  %.  In 
the  case  of  the  die,  the  turning  up  of  the  ace  might  be  considered 

*  We  are  not  concornod  here  with  the  various  types  of  probability  discussed  in 
such  treatises  as  Keynes,  op.  cit. 
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a  success.  Since  there  are  six  equally  likely  ways  for  the  die  to 
fall,  the  probability  of  the  successful  occurrence  of  this  event  is 
^,  and  that  of  its  failure  is  f .  It  should  also  be  noted  that  the 
odds  are  even  of  turning  up  a  head  with  a  coin,  but  are  five  to 
one  against  the  turning  up  of  an  ace  on  a  die. 

If  several  pennies  or  dice  are  used,  the  resulting  tosses  are 
considered  as  compound  events,  and  the  occurrences  of  the  indi- 
vidual events  are  regarded  as  entirely  independent  of  one  another. 
Thus  with  two  pennies,  a  toss  resulting  in  two  heads  is  a  com- 
pound event,  and  the  fall  of  one  penny  is  not  influenced  in  any 
way  by  the  fall  of  the  other. 

It  should  be  observed  that  the  same  results  will  be  obtained 
whether  we  deal  with  the  occurrences  of  a  number  of  similar 
events,  or  with  several  trials  of  the  same  event.  This  may  be 
illustrated  in  the  case  of  penny-tossing.  The  various  tosses 
which  occur  with  three  coins  are  the  same  combinations  that 
arise  when  one  penny  is  tossed  three  times  in  succession  and  the 
individual  occurrences  then  combined.  A  compound  event  may 
therefore  be  obtained  by  several  trials  of  a  single  event. 

The  probability  for  the  occurrence  of  a  compound  event  such 
as  all  heads  on  three  successive  trials  with  a  penny  (or  from  one 
toss  of  three  pennies)  may  be  obtained  by  applying  the  defini- 
tion of  probability  given  above.  The  number  of  equally  likely 
ways  in  which  the  coin  may  fall  on  the  first  trial  is  2,  and  on 
each  of  the  other  two  trials  also  2,  so  that  the  total  number  of 
equally  likely  possible  ways  for  the  compound  event  to  occur  is 
2x2x2  =  8.  The  number  of  favorable  ways  for  the  event  to 
happen  is  clearly  1,  so  that  the  required  probability  is  ^.  By 
similar  reasoning  it  may  be  shown  that  if  the  probability  of  an 
event  is  p,  the  probability  of  its  occurrence  on  all  of  n  trials  is  p^. 
In  case  we  are  dealing  with  a  number  of  dissimilar  independent 
events  whose  individual  probabilities  are  pi,  P2,P3  •  •  •,  the  prob- 
ability of  their  all  occurring  together  is  pi  X  P2  X  ps  •  •  •• 

This  last  theorem  may  be  illustrated  in  the  case  of  a  penny, 
a  die,  and  a  deck  of  playing  cards.    The  probability  of  turning 
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up  a  head  on  the  penny,  an  ace  on  the  die,  and  the  king  of 
spades  from  the  deck  on  one  trial  for  each  is  ^  X  ^  X  5^  =  e^- 
The  probabilities  of  a  complete  set  of  compound  events  may 
be  illustrated  by  examining  the  combinations  which  occur  when 
three  coins  are  being  tossed.  If  the  coins  are  designated  by  1,  2, 
and  3,  while  H  stands  for  head  and  T  for  tail,  the  following 
arrangement  of  the  eight  different  throws  may  be  made : 


(1) 

T 

T     T     H 

T     H     H 

H 

(2) 

T 

T     H     T 

H     T     H 

H 

(3) 

T 

H     T     T 

H     H     T 

H 

Of  the  8  equally  likely  combinations,  one  is  TTT,  or  all  tails, 
while  another  is  TTH,  or  two  tails  and  a  head.  This  latter 
compound  event  may  occur  in  3  different  ways,  however,  so 
that  the  probability  of  its  occurrence  is  f .  A  complete  set  of 
such  probabilities  may  then  be  set  down  as  follows : 

Probability  of  TTT  =  ^ 
Probability  of  TTH  =  f 
Probability  of  THH  =  f 
Probability  of  HHH  =  | 

A  general  expression  for  the  above  results  may  now  be  ob- 
tained by  using  the  theorem  for  the  probability  of  compound 
events.  The  probability  that  an  event  will  occur  on  all  of 
n  trials  is  evidently  p".  In  the  above  problem  this  is  (|)^  =  |. 
The  probability  that  the  event  will  occur  n—1  times  and  fail 
once  is  p'^'^q.  This  result,  however,  may  occur  in  71  different 
ways,  as  is  evident  from  the  illustrative  problem.  The  complete 
probability  for  n—1  successes  and  one  failure  is  therefore 
np'^'^q.  Next,  the  probability  that  in  n  trials  the  event  will 
occur  n  —  2  times  and  fail  twice  is  p"~2g,2  g^^  again,  this  may 
occur  in  the  number  of  ways  in  which  two  things  may  be  selected 

fi('}fi I) 

from  n,  which  is  — ^"tj — ^z—  =  nC-i-  The  total  probability  is  there- 
fore „C2(p"~^Q'~).    Thus  for  three  trials  the  probability  that 
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there  will  be  one  success  and  two  failures  is  -— ^  ^  91 9 )  ~  « 
in  this  example. 

Continuing  in  the  same  waj^  it  is  evident  that  the  general 
expression  for  the  probability  of  obtaining  exactly  r  successes 
and  {n  —  r)  failures  is  given  by  nCrP'q''~\ 


4.  The  Binomial  Theorem  and  the  Point  Binomial 
The  binomial  theorem  may  be  written  in  the  form 
(a  +  by  =  a'^  +  na^'-'^b  +  ^^^^  ~  ^^  a^-^b^ 

The  expansion  on  the  right  is  the  general  result  of  multiply- 
ing out  (a  -\-b)(a-\-  b)  (a  -\-b)  to  n  factors.    By  making  use  of 
the  notation  for  combinations,  a  more  convenient  form  of  this 
expansion  may  be  obtained  : 
(a+&)"  =  nCoa''  +  nCia"-ib+nC2a"-262+„C3a''-363_^. .  '^^Cnb''.  (76) 

Applying  this  theorem  to  (q  +  p)",  we  have 

(q+py  =  nCoq''+nCiq^-^p  +  nC2q"-^p^  +  nCzq^-^p^  +  ' '  '-\-nCnp'',  (77) 

Point  binomial; 

the  terms  of  which  agree  with  the  general  expression  for  the 
probability  of  r  successes  found  in  the  preceding  section.  The 
conclusion  then  is  that  if  n  trials  be  made  of  an  event  for  which 
the  probability  of  occurrence  is  p  and  the  probability  of  failure  is  q, 
the  probabilities  o/O,  1,  2,  •  •  •  n  successes  are  given  by  the  successive 
terms  in  the  expansion  of  the  binomial  (q  +  p)". 

As  an  illustration  of  this  theorem  the  thirteen  terms  of  the 
binomial  (h  -f  h)^'~  are  worked  out  in  Table  36  on  page  196. 
These  are  the  probabilities  of  getting  0,  1,  2,  •  •  •  12  heads  when 
one  coin  is  tossed  twelve  times  or  twelve  coins  are  tossed  once. 

It  is  apparent  from  these  results  that  the  probability  of  getting 
all  heads  or  all  tails  is  very  small.  If  twelve  coins  were  used,  only 
about  once  in  4000  throws  would  such  an  event  occur. 
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Table  36.    Illustrating  the  Probabilities  of  Obtaining  0,  1,  2 
Heads  in  Tossing  Twelve  Coins 


Sucx^ESSES  (Heads) 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

Total 


Probabilities 


1 

4096 
1  2 

4656 

66 

4C)d6 

220 

495 

409^ 

792 

4od6 

924 

7  92 

49  5 

2  20 
4096 

■6  6 
4096 

1  2 

4096 

1 

40&6 


.000244 
.002930 
.016113 
.053711 
.120850 
.193359 
.225586 
.193359 
.120850 
.053711 
.016113 
.002930 
.000244 


1.000000 


The  expression  {q  +  pY  is  often  called  the  point  binomial, 
since  its  expansion  is  represented  by  a  series  of  isolated  points. 

In  Fig.  47  these  points 
have  been  connected  by 
straight  lines,  forming 
a  polygon  very  much 
like  the  normal  curve 
in  general  appearance 
(see  Chapter  XII). 

It  has  already  been 
proved  that  the  prob- 
ability of  a  specified 
number  of  successes  is 
given  by  the  appropri- 
ate term  in  the  point 
binomial.  Another  important  result  is  that  the  prohability  of 
an  event  occurring  r  or  more  times  in  n  trials  is  the  sum  of  the 
terms  in  the  expansion  of  {q  -\-  p^  from  r,Crq"~''p''  to  nCnP""  inclu- 
sive. This  follows  from  the  fact  that  the  n-\-\  compound  events 
are  mutually  exclusive,  or  such  that  the  occurrence  of  one  com- 
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Fig.  47.   Plot  of  the  binomial  {^  +  ^y- 
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bination  excludes,  for  that  throw,  the  other  n  possible  arrange- 
ments. As  shown  in  algebra,  the  probability  that  some  one  or 
other  of  such  mutually  exclusive  events  will  occur  is  the  sum  of 
the  probabilities  of  the  separate  (here  compound)  events. 

Thus  if  twelve  coins  are  thrown  the  probability  of  obtaining 
nine  or  more  heads  on  a  single  toss  is  the  sum  of  the  probabilities 


^  +  -d^  +  T^fe  +  ToW  =  ^,  or  .073.  This  result  may 
also  be  worked  out  by  noting  that  of  the  4096  equally  likely 
arrangements  of  the  12  coins  there  are  220  ways  in  which  nine 
heads  may  turn  up,  66  ways  in  which  ten  heads  may  occur,  12 
ways  for  eleven  heads  to  appear,  and  1  way  in  which  twelve 
heads  may  be  obtained.  This  gives  a  total  of  299  ways  in  which 
at  least  nine  heads  may  appear,  and  the  probability  for  such 
an  occurrence  is  ^^^  from  the  definition  of  simple  probability. 


5.  The  Mean  of  the  Point  Binomial  and  its 
Standard  Deviation 

We  shall  next  prove  two  interesting  theorems  in  connection 
with  the  point  binomial.  These  are  known  as  the  theorems  of 
Bernoulli  and  are  of  great  importance  in  statistical  theory. 

The  mean  of  the  point  binomial  is  np,  and  its  standard  deviation 
is  y/npq. 

In  proving  the  theorems,  M  and  a  are  calculated  as  follows : 

Table  37.  Calculation  of  M  and  u  for  the  Point  Binomial 


Successes 

Frequency 

d 

0 

1 

2 
3 

fd 

fd^ 

0 
1 

2 
8 

qn 

1  -2     ^        ^ 
n(n-l)(n-2) 

1-2-3         *        ^' 

nin  -  l)(?"-2p2 

«(«-l)(n-2) 

1-2            ^        ^ 

nq^-^p 
2  n(n  -  l)qn-2p2 

3n(n-l)(n-2)   ._„   , 
J  .  2            «        P 

Totals  .    . 

1 

np 

np[l  +p(n-  1)] 
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The  sum  of  the  frequencies  is  (q  +  p)"",  or  unity,  and  2/c?  may  be 
readily  factored  into 

np  9^-1  +  (n  -  l)q^-^p  +  (^~^)^^~^)  ^^-3^2  ^  . .  1 

=  np{q  + pY~'^  =  np. 

We  may  now  apply  the  formula  for  the  mean,  M  =  A-\-l-^\h. 

Since  A  =  0,  iV=  1,  2/d  =  Tip,  and  /?.  =  1,  the  mean  of  the  bino- 
mial becomes  n^         *        /  Mean  of  the  point  1       ,^^^ 

^  =  ''P'       {         binomial  |      (^8) 

In  order  to  obtain  the  standard  deviation  it  is  necessary  to 
find  2/c?2  fQj.  ^hg  above  series.  The  last  column  of  items  in 
Table  37  may  be  factored  as  follows : 


2/^2  _   r^p 


J.  *  ^ 


The  terms  in  the  brackets  may  now  be  broken  up  to  form  two 
series  in  {q  +  p).   Thus, 

2/(^2  =  np  r j^«-i  +  (n-l)q^-^p+  (^- l)(^-2)^n-3^2^. . .  j 
+  I  (71  -  l)q^-^p  +  2  (^  -  1)  (^  -  2)  ^„_3^2  +  •  .  .  I  j 

=  ^P[(<7  +  P)"~^  +  (ri-l)p  1^"""+  (n-2)g'"-3pH }] 

=  np[(q  +  p)"-i  +  {n-  l)p(q  +  p)"-^] 
=  np[l  -\-  (n  —  l)p]. 

Applying  formula  (17)  for  standard  deviation,  we  find  that 


a  =  ^np[l  +  (n-l)p-]  _  ^,^,  ^  v«p(l  -  p), 


or  (y  =  y/npq.      {Standard  deviation  of  the  point  binomial!      (79) 

The  above  formulas  make  possible  the  complete  description  of 
certain  distributions  given  by  chance.  The  terms  in  the  series  fur- 
nish the  ordinates  of  the  curve,  while  the  mean  and  the  standard 
deviation  from  the  formulas  (78)  and  (79)  are  convenient  meas- 
ures of  the  central  tendency  and  dispersion  of  such  a  distribution. 
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For  the  binomial  {^-\-  ^Y^  the  mean  and  the  standard  devia- 
tion by  these  formulas  work  out  at  6  and  Vs  respectively.  In 
the  case  of  such  a  symmetrical  series,  the  mean  is  of  course 
obtained  by  inspection. 

If  twelve  dice  are  thrown  and  the  turning  up  of  an  ace  is 
considered  a  success,  the  probabilities  of  0,  1,  2  •  •  •  12  suc- 
cesses are  given  by  the  terms  in  the  expansion  of  (f  +  ^)^^. 
This  series  is  distinctly  skew,  but  the  mean  and  the  standard 
deviation  are  readily  found  to  be  2  and  .y/f  on  applying  for- 
mulas (78)  and  (79).  Practical  evidence  of  the  convenience  of 
these  formulas  may  be  obtained  by  working  out  the  same 
results  directly  from  the  frequencies. 

6.  Experimental  Verification  of  the  Binomial  Law 

In  order  to  see  whether  or  not  the  actual  results  of  penny  and 
dice  tossing  come  out  as  predicted  by  the  above  formulas,  it  will 
be  interesting  to  cite  one  or  two  examples.  While  such  experi- 
ments serve  to  verify  in  a  rough  way  the  properties  of  the  point 
binomial,  it  should  be  noted  that  strictly  speaking  they  are  not 
verifications  at  all  because  the  conditions  implied  in  the  for- 
mulas can  never  be  met  on  actual  trial.  The  perfectly  homo- 
geneous penny  or  die  does  not  exist,  nor  is  it  possible  to  make 
the  tosses  so  that  certain  throws  are  not  favored  over  certain 
others.  Differences  between  the  observed  trials  and  the  theo- 
retically correct  results  will  then  be  due  not  only  to  the  number 
of  trials  or  size  of  the  sample  but  to  imperfections  in  the 
objects  thrown,  and  to  faulty  methods  in  tossing  them.  The 
student  is  urged,  however,  to  make  a  few  personal  experiments 
such  as  those  quoted  below  in  order  that  he  may  become  more 
familiar  with  the  meaning  and  practical  utility  of  the  bino- 
mial law. 

In  the  following  experiment  twelve  dice  were  thrown  4096 
times,  the  method  being  to  roll  them  down  an  inclined  gutter 
of  corrugated  paper.    A  throw  of  4,  5,  or  6  was  considered  a 
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success,  so  that  p  =  q  =  2-  The  theoretical  mean  will  then  be 
np,  or  6,  and  the  standard  deviation  y/npq,  or  1.732.  The  follow- 
ing table  gives  the  observed  and  theoretical  frequencies. 

Table  38.  Observed  and  Theoretical  Frequencies  of  0,  1,  2  . . . 
Successes  from  the  Tossing  of  Twelve  Dice  with  Throws  of  Four, 

Five,  or  Six  as  Successes 


SUCCF-SSES 

Observed 
Frequency 

Theoretical 
Frequency 

Successes 

Observed 
Frequency 

Theoretical 
Frequency 

0 

1 

2 
3 
4 
5 
6 

7 
60 
198 
430 
731 
948 

1 

12 
66 
220 
495 
792 
924 

7  .    .    . 

8  .    .    . 

9  .    .    . 

10  .    .    . 

11  .    .    . 

12  .    .    . 

847 

536 

257 

71 

11 

792 

495 

220 

66 

12 

1 

Total    .    . 

4096 

4096 

The  mean  of  the  observed  distribution  is  6.139  and  its  stand- 
ard deviation  is  1.712.  The  actual  proportion  of  successes  is 
0.512  instead  of  0.5.  The  agreement,  on  the  whole,  is  there- 
fore rather  good. 

In  the  next  experiment  a  throw  of  a  6  was  considered  a  success, 
so  that  p  =  ^y  and  q  =  %.  The  theoretical  mean  is  2  and  the 
standard  deviation  is  1.291.  The  observed  frequency  distribu- 
tion was  as  follows : 

Table  39.  Observed  Frequencies  of  0,  1,  2  .  .  .  Successes  resulting 
FROM  the  Throws  of  Twelve  Dice  with  the  Turning  of  a  Six  as 

A  Success 


Successes 

Frequency 

Successes 

Frequency 

0       

447 

1145 

1181 

796 

380 

5 

6 

7 

8 

115 

1      

24 

2      

7 

3      

1 

4      

Total 

4096 

The  observed  mean  is  2.000  and  standard  deviation  1.296,  while 
the  actual  proportion  of  successes  is  .1667,  agi'eeing  with  the 
theoretical  values  to  an  extent  that  is  probably  accidental. 
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The  above  results  show  that  with  careful  extensive  experi- 
ments such  as  these,  the  observed  series  is  in  good  agreement 
with  the  binomial  expansion. 

7.  The  Binomial  Applied  to  Statistical  Data 

In  the  case  of  frequency  distributions  of  observed  data  af- 
fected by  many  factors,  the  point  binomial  might  often  be  used 
were  it  not  for  the  large  number  of  terms  involved,  and  the 
difficulty  of  replacing  the  mathematical  probability,  known 
a  priori,  by  an  empirical  probability  ratio  furnished  by  the  data. 

As  an  illustration  we  may  take  the  records  of  400  candidates 
for  the  master's  degree  in  a  certain  university.  Among  other 
requirements  it  was  necessary  for  the  candidate  to  have  an 
average  of  B—  or  better.  For  the  present  purposes,  such  an 
average  may  be  considered  a  success,  and  a  lower  average  may 
be  regarded  as  a  failure.  Out  of  400  candidates  331  maintained 
a  satisfactory  average,  so  that  the  empirical  probability  of  such 
a  success  is  ^wh  =  -8275.  It  should  be  noted  that  such  a  ratio 
might  change  considerably  from  time  to  time,  and  would  also 
tend  to  be  unstable  when  applied  to  small  numbers.  We  cannot 
expect,  therefore,  to  get  as  good  results  from  such  empirical 
ratios  as  from  the  probabilities  in  the  case  of  penny-tossing. 

The  average  number  of  candidates  coming  up  at  one  time  was 
about  ten.  Taking  this  number  as  the  size  of  the  sample  (cor- 
responding to  the  number  of  coins  tossed)  the  point  binomial 
(.1725  +  .8275)^'^*  might  be  used  to  determine  the  probability 
for  any  number  of  successes,  say  nine  or  more. 

The  terms  in  this  binomial  (computed  by  logarithms)  to- 
gether with  the  results  actually  found  by  trial  are  given  in 
the  table  on  page  202.  The  probability  of  getting  9  or  more 
successes  in  a  sample  of  10  is  the  sum  of  the  probabilities  .314 
and  .150,  or  .464.  The  expected  number  from  400  candidates 
will,  therefore,  be  400  x  .464,  or  186.  This  result  happens  to  be 
in  close  agreement  with  the  observed  number,  (6  +  13)10=  190. 
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Table  40.    Observed  and  Theoretical  Frequencies  for  the  Number 

OF  Successful   Candidates  for  the  Master's  Degree,  in  Samples 

OF  Ten,  the  Total  Number  of  Candidates  being  400 


Successful  Candidates 
OUT  OF  Ten 

Observed 
Frequency 

Theoretical 
Frequency 

Probabilities 

10 

6 

6.0 

.150 

9 

13 

12.6 

.314 

8 

12 

11.8 

.294 

7 

7 

6.5 

.164 

6 

0 

2.4 

.060 

5 

1 

0.6 

.015 

4 

1 

0.1 

.003 

3 

0 

0.0 

.000 

Total 

40 

40 

1.000 

13- 

1 

K 

12- 

^V-| 

1 

A 

11- 

1 
I 

\ 

\ 

10- 
9- 

1 

\ 

1 

\ 

8- 

1 

\ 

t 

1 
1 

\ 

|7- 

-^- 

£   6- 

; 

\ 

\ 

5- 

\ 

1 

I 

4- 

\ 

3- 

\ 

\ 

2- 

\ 

\ 

\ 

1- 

^. 

0- 

1 

'               1 

' i 

1 

1 

1 

1 

1 

1 

10        9         8        7         6         5        4 
Successful  candidates  out  of  ten 

Fig.  48.   Comparison  of  theoretical  and 
observed  frequencies  for  candidate  data 


The  complete  set  of 
theoretical  frequencies  is 
found  by  multiplying  the 
probability  values  by  40. 
These  frequencies  agree 
fairly  well  with  those 
given  by  the  data  as 
shown  in  the  above  table 
and  in  Fig.  48. 

Further  evidence  of  the 
agreement  of  the  two 
series  may  be  found  by 
comparing  the  theoretical 
and  observed  standard 
deviations.  The  former 
is  VwpQ',  or  1.19,  while 
the  latter  is  1.28.  The 
difference,  or  0.09,  may 
be  readily  accounted  for 
by  chance  fluctuations 
in  sampling  (see  formula 


(91)  and  the  testing  of  differences  in  Chapter  XIII). 
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EXERCISES 

1.  Expand  the  following  binomials,  and  plot  the  results. 

a  +  iy,  a  +  hy.  a  +  ir>  c-i  +  -9)^  (-i  +  -^y". 

2.  If  the  terms  in  the  expansion  of  (|  +  ^)  ^^  represent  the  proba- 
bilities of  0,  1,  2,  3  ...  10  successes,  find  the  probability  of  obtaining 
seven  or  more  successes  in  ten  trials.  ('I'^Ti  ~  •l'^2.   Ans.) 

3.  Find  the  means  and  standard  deviations  for  the  binomials  of 
Exercise  1,  using  formulas  (78)  and  (79).  Verify  some  of  the  answers 
by  direct  calculation  from  the  full  expansions  of  the  binomials. 

4.  From  Table  41  of  Chapter  XII  determine  the  empirical  prob- 
ability of  a  man  selected  at  random  being  over  71i|  inches  in 
height.  Use  the  total  distribution.  (.039.  Ans.)  What  is  the  prob- 
ability that  a  man's  height  will  be  between  ^^\^  and  67i|  inches? 
(.155.  Ans.)  What  is  the  probability  that  a  man's  height  will  be 
greater  than  72l|  inches  or  less  than  62^|  inches?   (.052.  Ans.) 

5.  Suppose  that  a  penny  is  tossed,  a  die  thrown,  and  a  card 
drawn  from  an  ordinary  deck.  What  is  the  probability  of  the  com- 
bined event:  head  on  the  coin,  ace  or  six  on  the  die,  and  a  heart 
on  the  card,  with  a  single  trial  for  each  ?  (2^^.   Ans.) 

6.  What  is  the  probability  of  turning  up  a  total  of  eight  with 
two  dice?  (g^g.    Ans.) 

7.  If  three  cards  are  drawn  from  a  suit  of  thirteen  cards,  what  is 
the  chance  that  both  king  and  queen  are  drawn?  (^^q.  Ans.) 

8.  Show  that  if  np  be  a  whole  number,  the  mean  of  the  binomial 
coincides  with  the  greatest  term. 

9.  Derive  formulas  (78)  and  (79)  by  differentiating  the  expres- 
sion (q  +  px)"  with  respect  to  x  and  setting  x  =  1. 


CHAPTER  XII 

THE  NORMAL  PROBABILITY  CURVE 

1.  Introductory 

In  the  present  chapter  we  shall  discuss  the  properties  and 
uses  of  the  normal  probability  curve,  the  general  form  of  which 
is  doubtless  already  familiar  to  the  student  (see  Fig.  51). 

An  example  of  a  distribution  resembling  the  normal  probabil- 
ity curve  is  furnished  by  the  mental  age  data  in  Fig.  49.  When 
these  data  are  separated  into  ''normals"  and  ''  defectives"  two 
fairly  symmetrical  curves  result.  Burt*  explains  the  lack  of 
complete  symmetry  in  the  curve  for  normals  on  the  ground 
that  the  Binet  Scale  lacks  adequate  tests  for  the  brighter  chil- 
dren of  the  older  ages.  He  concludes  that  even  though  his  data 
are  somewhat  irregular,  they  do  not ''  in  any  way  contradict  the 
hypothesis  of  '  normality, '  the  theory  that  ability  is  distributed 
in  close  conformity  with  the  normal  curve  of  error." 

In  the  case  of  certain  physical  characteristics  such  as  height, 
the  normal  curve  appears  to  give  an  excellent  fit  to  the  observ^a- 
tions.  The  data  in  Table  41,  quoted  from  Yule,t  fui'nish  a  very 
good  example. 

The  histogram  for  the  frequencies  in  the  total  column  of  the 
table  is  shown  in  Fig.  50,  where  the  symmetry  and  general 
resemblance  to  the  normal  curve  are  apparent. 

The  above  examples  suggest  that  the  frequency  distributions 
of  some  mental  and  physical  traits  conform  fairly  well  to  the 
normal  curve.  It  would  be  far  from  correct,  however,  to  assume 
that  all  human  characteristics  are  normally  distributed.  This 
assumption  was  made  by  an  early  statistician  named  Quetelet. 

♦  Cyril  Burt,  Mental  and  Scholastic  Tests,  p.  162.  King  and  Son,  Ltd.,  London, 
1921. 

t  Yule,  Introduction  to  Statistics,  p.  88 

204 


2500 
2200E 

8  1800 
O-1600 
^  1400 
o  1200 

^1000 

2  900 

U  800 

c  700 

-  600 

S  BOO 

?i  400H 

X. 
o 


300 


200 


100 
50 


Normals 


Ment£il 
Defectives 


1 


-4      -3     -2      -1      0     +1    +2    +3+4S.D. 
Deviation  from  average 


Fig.  49.   Distribution  according  to  general  intelligence  of  children  of  ordinary 
elementary  and  special  M.D.  schools 

From  "Mental  and  Scholastic  Tests,"  by  Cyril  Burt.   Courtesy  of 
P.  S.  King  and  Son,  Ltd. 
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Fig.  50.   Histogram  for  heights  of  8585  men 
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Table  41.   Distribution  of  Stature  for  Adult  Males  Born 

IN  THE  British  Isles 


Height  in  Inches 

Number  of  Men  according  to  Birthplace 

Total 

(Without  Shoes) 

England 

1 
1 

9 

16 

48 

117 

254 

473 

753 

886 

918 

881 

740 

524 

320 

128 

70 

39 

12 

3 

1 

Scotland 

Wales 

Ireland 

76i4-77U      

75^76^      

74^7511      

731I-74U      

72H-73H      

71^-72^      

70f|-71f|       

69U-70U      

68U-69f|       

67y6   68t^       

6611-67^      

65y6^-66y^       

64^-65^       

631I-64H       

62H-63U       

6m-62l|       

60^61H       

59^-6011       

58^-59^      

57^5811       

56x^-57x15^       

1 

4 

6 

15 

26 

69 

102 

115 

218 

210 

210 

139 

109 

47 

19 

9 

2 

2 

1 

1 

1 

2 

6 

21 

33 

52 

72 

128 

145 

108 

83 

48 

30 

9 

1 

1 

3 

10 

15 

25 

40 

62 

73 

58 

33 

15 

7 

2 

2 

1 

2 

5 

16 

32 

79 

202 

392 

646 

1063 

1230 

1329 

1223 

990 

669 

394 

169 

83 

41 

14 

4 

2 

Total 

6194 

1304 

741 

346 

8585 

He  pictured  an  average  man  with  physical  and  social  traits  at 
the  means  of  a  series  of  probability  curves.  The  work  of  such 
men  as  Pearson  and  Charlier,  however,  has  since  shown  that 
these  characteristics  are  best  represented  by  a  variety  of  curves 
among  which  the  probability  curve  is  a  special  t>T)e.  (See  sec- 
tions 8  and  9  of  Chapter  XVI.) 

It  will  be  shown  in  section  5  that  the  resemblance  of  a  fre- 
quency distribution  to  the  normal  curve  cannot  be  satisfactorily 
determined  by  mere  inspection  of  the  data.  A  rigorous  test  of 
the  normality  of  a  given  distribution  involves  the  supen:)osition 
of  a  normal  curve  on  the  data  and  a  mathematical  comparison  of 
the  observed  and  theoretical  frequency. 
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The  normal  probability  curve  is  very  important  in  the  field  of 
educational  measurements  because  of  its  usefulness  in  scale  con- 
struction and  in  many  calculations  involving  qualitative  series. 
It  is  usually  necessary  in  such  problems  to  assume  some  form 
of  distribution  and  the  normal  curve  is  taken  because,  of  all  the 
curves  which  might  be  employed,  it  gives  the  best  single  ap- 
proximation to  the  ordinary  test  score  distribution.  The  mathe- 
matical properties  of  the  probability  curve,  including  tabulations 
of  its  integral  and  ordinate,  make  the  calculations  involved  very 
much  simpler  than  with  some  skew  form  of  curve. 

Although  no  formal  derivation  of  the  normal  curve  will  be 
given,  its  relation  to  the  point  binomial  will  be  shown  as  well  as 
its  usefulness  in  the  elementary  theory  of  probability. 


2.  The  Equation  of  the  Normal  Probability  Curve 

As  already  pointed  out,  the  practical  use  of  the  point  bino- 
mial requires  a  great  deal  of  labor.  If,  for  example,  the  samples 
in  the  problem  of  section  7,  Chapter  XI,  had  consisted  of  twenty 
instead  of  ten  candidates,  the  terms  in  the  binomial  {q  +  p)^^ 
would  have  to  be  computed. 

An  important  simplification  of  the  binomial  law  may  now  be 
reached  by  allowing  the  size  of  n  to  increase  indefinitely.  It  is 
obvious  from  the  binomials  discussed  thus  far  that  as  n  becomes 
larger  the  resulting  polygon  over  the  n  -f  1  points  becomes 
smoother  and  tends  to  spread  out  more  and  more  in  both  direc- 
tions from  the  mean.  The  limit  to  the  point  binomial,  (q  +  p)^ 
as  ?z  increases  indefinitely,  may  be  shown  by  mathematical  proof  * 
to  be  given  by  the  continuous  curve 

Z/=-4=-c"^,     r Normal  curve tj    ^g^y 
VSircr  1  with  area  =1    J 


*  Yule,  Introduction  to  Statistics,  p.  301  (simple  prooQ. 

t  The  normal  probability  curve  was  first  given  by  De  Moivre  in  1733  but  was 
later  rediscovered  by  Laplace  and  Gauss. 
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where  e  =  2.7183  •  •  •,  which  is  the  base  of  the  Napierian  system 
of  logarithms,  and  tt  is  the  famihar  ratio  of  the  circumference 
of  a  circle  to  its  diameter. 

Just  as  the  sum  of  the  ordinates  in  the  point  binomial  {q  +  pT 
is  equal  to  unity,  so  the  area  under  this  curve  is  equal  to  1. 

From  equation  (80)  it  is  evident  that  for  x  =  0,  y  = 


V27r(7 

and  that  about  the  value  0,  which  is  the  mean,  the  curve  is  sym- 
metrical, because  the  same  positive  and  negative  values  of  x  give 

a  single  value  for  y.    By  writ- 
ing the  equation  in  the  form 


/   *"* 

b 

> 

\ 

y  = 


+ 


xl 


(81) 


V2^<rc    2°-' 


Mean 


it  is  also  apparent  that  no 
matter  how  large  or  small  x 
is  taken,  y  will  never  become 
equal  to  zero.  The  curve  is 
thus  symmetrical  about  the 
mean  at  x  =  0,  and  extends 

indefinitely  in  both  directions,  approaching  the  x-axis  as  an 

asymptote  as  shown  in  Fig.  51. 

In  case  the  normal  curve  is  applied  to  data  for  which  the  total 

frequency  is  N  and  not  unity,  the  form  of  the  equation  becomes 


Fig.  51.  Normal  curve  with  unit  area 

(if  (7  =  1) 


y  = 


N 


x2 


X2 


V2 


e      2a-2=y^e      2o-2j 


TTO" 


r  Normal  curve  ^ 
\  with  area  =  N  j 


(82) 


each  of  the  ordinates  for  unit  area  being  multiplied  by  N,   The 

N 
coefficient     . is  often  designated  as  ?/o,  and  is  the  maximum 

V2  TTO- 

ordinate  at  x  =  0,  since  e"  =  1  as  noted  in  Chapter  IV,  section  4. 
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3.  The  Area,  Ordinates,  and  Deviates  of  the 
Normal  Curve 

If  the  standard  deviation  of  the  normal  curve  be  chosen  as  1, 
for  convenience,  the  equation  then  takes  the  form 


z  = 


1 


X2 


p~T.   /Ordinate  of  the  normal  curve,  with"]     /qo\ 

^2  jj-  \    unit  area  and  standard  deviation    / 


The  values  of  x,  or  number  of  standard  deviations  from  the 
mean,  are  called  deviates ;  z  is  the  usual  symbol  for  the  ordinate 
at  a  given  deviate; 
and  ^a  will  be  used 
to  denote  the  area 
from  the  mean  to 
such  a  de\iate.  These 
three  functions  of  the 
curve  have  been  com- 
puted and  tabled  in 
various  ways,  and  are 
of  the  greatest  impor- 
tance for  a  variety 
of  statistical  calcula- 
tions. An  illustration 
of  these  functions  is 

given  in  Fig.  52,  the  numbers  being  taken  from  Table  42.  It 
will  be  noted  that  for  a  deviate  x  =  1.5,  the  ordinate ;:  will  have 
the  value  .130,  while  the  area  from  the  mean,  or  ia,  will  be 
43.3  per  cent  of  the  total  area  of  the  curve. 

The  methods  for  calculating  the  areas  and  deviates  are  a 
part  of  the  calculus,  but  the  ordinates  may  be  determined  by 
merely  substituting  various  values  for  x  in  equation  (83 j.   For 

example,  when  x  =  0,  z  =  ^  ^^^^  =  .3989.  Similarly,  when  x=l, 


M     X=1.5 


Fig.  52.  Illustrating  area,  ordinates,  and  devi- 
ates for  a  normal  curve 


Z  = 


2.5066 


2.5066 
(2.7183j-i  =  .2420. 
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For  certain  problems,  which  will  be  taken  up  later,  it  has 
been  found  convenient  to  calculate  and  table  these  functions  in 
two  ways : 

(1)  Areas  and  ordinates  for  given  deviates,  and 

(2)  Deviates  and  ordinates  for  given  areas. 

Complete  tables  for  these  values  are  found  in  Pearson's* 
''Tables  for  Statisticians  and  Biometricians,"  in  Kelley'sf 
''Statistical  Method,"  and  in  more  abbreviated  form  in  a  hand- 
book prepared  by  the  writer.^    Two  short  lists  of  three-place 

values  are  also  given 
in  Tables  42  and  43. 

It  is  apparent  that 
a  is  the  area  from 
—  a:  to  +  X,  as  shown 
in  the  accompanying 
figure.  Whenx  =  ±l, 
a  =  .682,  from  which 
it  follows  that  more 
than  two  thirds  of 
the  total  area  under 
the  curve  is  included 
between  these  limits. 
When  a:  =  di  3,  a  =  .998,  showing  that  a  range  of  Go-  includes 
more  than  99  per  cent  of  the  frequency.  It  will  also  be  noted 
that  the  ordinate  at  x  =  3  is  very  small,  being  only  3-5-9  of  2/0, 
or  .01  of  the  maximum  ordinate  at  the  mean. 

Table  43  for  deviates  and  ordinates  in  terms  of  area  from  the 
mean  shows  that  for  equal  increments  of  -\  <^  there  is  very  little 
change  in  x  and  z  in  the  neighborhood  of  the  mean,  but  very 
rapid  change  toward  the  extremities  of  the  curve.  For  \a  =  .50 
the  ordinate  is  equal  to  zero,  and  the  deviate  is  infinite. 

*  Tables  for  Statisticians  and  Biometricians,  edited  by  Karl  Pearson.  Cambridge 
University  Press,  England.   Second  edition,  1924. 

t  T.  L.  Kelley,  Statistical  Method.   The  Macmillan  Company,  1923. 

X  Karl  J.  Holzinger,  Statistical  Tables  for  Students  in  Education  and  Psychol- 
ogy.   The  University  of  Chicago  Press,  1925. 


x=-i    x=o    a:=+i 
Fig.  53.  Illustrating  a  for  a  normal  curve 
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Table  42.   Areas  and  Ordinates  for  Given  Deviates  from  the  Mean 


X 

\a 

z 

X 

\a 

z 

0.0 

.000 

.399 

2.1 

.482 

.044 

0.1 

.040 

.397 

2.2 

.486 

.035 

0.2 

.079 

.391 

2.3 

.489 

.028 

0.3 

.118 

.381 

2.4 

.492 

.022 

0.4 

.155 

.368 

2.5 

.494 

.018 

0.5 

.191 

.352 

2.6 

.495 

.014 

0.6 

.226 

.333 

2.7 

.497 

.010 

0.7 

.258 

.312 

2.8 

.497 

.008 

0.8 

.288 

.290 

2.9 

.498 

.006 

0.9 

.316 

.266 

3.0 

.499 

.004 

1.0 

.341 

.242 

3.1 

.499 

.003 

1.1 

.364 

.218 

3.2 

.499 

.002 

1.2 

.385 

.194 

3.3 

.500 

.002 

1.3 

.403 

.171 

3.4 

.500 

.001 

1.4 

.419 

.150 

3.5 

.500 

.001 

1.5 

.433 

.130 

3.6 

.500 

.001 

1.6 

.445 

.111 

3.7 

.500 

.000 

1.7 

.455 

.094 

3.8 

.500 

.000 

1.8 

.464 

.079 

3.9 

.500 

.000 

1.9 

.471 

.066 

4.0 

.500 

.000 

2.0 

.477 

.054 

4.1 

.500 

.000 

When  \  a  =  .25  it  will  be  noted  that  x  =  .674,  or,  more 
exactly,  x  =  .6744898.  This  value,  which  is  known  as  the  prob- 
able error,  is  therefore  given  by  the  relation 


P.E.=  .6744898  0-.    < 


f  Relation  between  1 


I 


P.E.  and  a      ) 


(84) 


It  is  very  frequently  used  as  a  unit  of  measurement  on  the 
normal  scale  instead  of  a,  chiefly  because  of  long  usage. 

It  may  also  be  observed  that  P.E,  and  Q  are  the  same  for  a 
normal  curve,  since  exactly  half  of  the  area  is  included  when  they 
are  laid  off  on  either  side  of  the  mean.  With  actual  data,  P,  E, 
will  not  be  equal  to  Q,  so  that  it  is  usually  better  to  avoid  the 
use  of  the  term  probable  error  in  describing  an  observed  fre- 
quency distribution.  The  term  arose  in  connection  with  dis- 
tributions of  error  such  as  those  in  astronomical  measurements. 
With  ordinary  data  such  a  deviate  does  not  represent  an  error 
and  the  term  probable  error  is  therefore  a  misnomer. 
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Table  43.    Deviates  and  Ordinates  for  Given  Area  from  the  Mean 


ha 

.  X 

z 

ia 

z 

z 

.00 

0.000 

.399 

.26 

0.706 

.311 

.01 

0.025 

.399 

.27 

0.739 

.304 

.02 

0.050 

.398 

.28 

0.772 

.296 

.03 

0.075 

.398 

.29 

0.806 

.288 

.04 

0.100 

.397 

.30 

0.842 

.280 

.05 

0.126 

.396 

.31 

0.878 

.271 

.06 

0.151 

.394 

.32 

0.915 

.262 

.07 

0.176 

.393 

.33 

0.954 

.253 

.08 

0.202 

.391 

.34 

0.994 

.243 

.09 

0.228 

.389 

.35 

1.036 

.233 

.10 

0.253 

.386 

.36 

1.080 

.223 

.11 

0.279 

.384 

.37 

1.126 

.212 

.12 

0.305 

.381 

.38 

1.175 

.200 

.13 

0.332 

.378 

.39 

1.227 

.188 

.14 

0.358 

.374 

.40 

1.282 

.175 

.15 

0.385 

.370 

.41 

1.341 

.162 

.16 

0.412 

.366 

.42 

1.405 

.149 

.17 

0.440 

.362 

.43 

1.476 

.134 

.18 

0.468 

.358 

.44 

1.555 

.119 

49 

0.496 

.353 

.45 

1.645 

.103 

.20 

0.524 

.348 

.46 

1.751 

.086 

.21 

0.553 

.342 

.47 

1.881 

.068 

.22 

0.583 

.337 

.48 

2.054 

.048 

.23 

0.613 

.331 

.49 

2.326 

.027 

.24 

0.643 

.324 

.50 

00 

.000 

.25 

0.674 

.318 

4.  Comparison  of  the  Point  Binomial  and  the 
Normal  Curve 

The  close  agreement  between  the  binomial  series  and  the 
normal  curve  may  be  illustrated  for  the  binomial  (^^-^)^^ 
the  ordinates  for  which  are  given  by  expansion  as  shown  in 
Chapter  XL 

In  order  to  compute  the  normal  ordinates  at  the  17  binomial 
points  it  is  first  necessary  to  calculate  the  values  of  the  latter  as 
deviates  from  the  mean.  Since  the  standard  deviation  of  the 
binomial  is  y/npq,  or  2,  for  the  above  series,  the  deviates  at 

0,  1,  2,  3,  •  •  •  successes  will  be  — 

2-8  ^     , 

— 7z —  =  —  3,  etc. 


-  =  -  4,  ^^  =  -  3.5, 
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The  ordinates  of  the  normal  curve  for  these  deviates  may 
now  be  looked  up  in  Table  42,  and  divided  by  2  in  order  to 
make  them  comparable  with  the  binomial  ordinates.  The  tabled 
values,  of  course,  are  for  unit  standard  deviation.  A  complete 
list  of  the  abscissas  and  ordinates  for  both  curves  may  then  be 
obtained  as  shown  in  Table  44. 


Table  44.  Ordinates  for  the  Binomial  (i  +  iy^,  with  Corresponding 

Normal  Ordinates 


SUCCF.SSES 

Binomial 
Ordinates 

X 

a 

Normal 
Ordinates 

FOR  (T  =  1 

Normal 
Ordinates 

FOR  (7  =  2 

0 

.000 

.000 

.002 

.0085 

.028 

.067 

.122 

.1745 

.196 

.1745 

.122 

.067 

.028 

.0085 

.002 

.000 

.000 

-4.0 
-3.5 
-3.0 
-2.5 
-2.0 
-1.5 
-1.0 
-0.5 
0.0 
+  0.5 
+  1.0 
+  1.5 
+  2.0 
+  2.5 
+  3.0 
+  3.5 
+  4.0 

.000 
.001 
.004 
.018 
.054 
.130 
.242 
.352 
.399 
.352 
.242 
.130 
.054 
.018 
.004 
.001 
.000 

.000 

1 

.0005 

2 

.002 

3 

.009 

4 

.027 

5 

.065 

6 

.121 

7 

.176 

8 

.199 

9 

.176 

10 

.121 

11 

.065 

12 

.027 

13 

.009 

14 

.002 

15 

.0005 

16 

.000 

Total 

1.000 

1.000 

From  these  values  and  by  inspection  of  Fig.  54  it  is  apparent 
that  the  agreement  between  the  two  curves  is  very  close.  For 
more  terms,  of  course,  the  discrepancies  between  the  ordinates 
would  have  been  even  less  than  those  found  here. 

The  equation  of  the  normal  curve  here  considered  is  clearly 

1 


V 


Vsx 


X3 

8  » 


since  it  is  only  necessary  to  substitute  o-  =  2  in  equation  (80). 
The  mean  of  the  curve  is  set  at  8  successes. 
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12      3     4 


10    11    12    13    14    15    16 
Successes 


Fig.  54.   The  point  binomial  (i+  2)^^  compared  with  the  normal  curve 


y 


Vs^ 


5.  Fitting  a  Normal  Curve  to  a  Frequency 
Distribution  of  Data 

The  method  for  fitting  a  normal  curve  to  a  series  of  observa- 
tions is  similar  to  that  just  described,  with  the  exception  that 
areas  and  not  ordinates  are  to  be  compared  in  determining  the 
goodness  of  fit.  The  superposed  curve  is  determined  by  taking 
its  area,  mean,  and  standard  deviation  equal  to  those  obtained 
from  the  data.* 

The  work  may  be  illustrated  for  the  distribution  of  I.Q.'s 
given  in  Table  20  of  Chapter  VII.  The  necessary  constants, 
already  worked  out,  are 

♦  For  a  more  complete  discussion  of  such  fitting  see  Chapter  XVI. 


THE  NORMAL  PROBABILITY  CURVE 


215 


N  =  4834, 

a  =  1.686  X  10     (1.661  with  Sheppard's  correction*), 
M  =  89.28. 

Using  formula  (82),  the  equation  of  the  desired  normal  curve 

will  be 

4834  -  V-^V 

2\1.661/  . 


y  = 


V2T(1.661) 


It  will  be  noted  that  the  standard  deviation  is  expressed  in  units 
of  class  intervals,  which  is  necessary  in  order  to  make  yo  com- 
parable with  the  observed  frequency  in  the  interval  at  the  mean, 
and  bring  the  total  area  and  frequency  equal  to  Nh, 

Table  45.   Normal  Ordinates  for  I.Q.  Data 


X 

ff 

Scalar  Abscissas 

2 

N 

y  =  —xz 

a 

0.0 

89.28 

.399 

1161 

±0.5 

97.58  and  80.98 

.352 

1024 

±1.0 

105.89  and  72.67 

.242 

704 

±1.5 

114.19  and  64.37 

.130 

378 

±2.0 

122.50  and  56.06 

.054 

157 

±2.5 

130.80  and  47.76 

.018 

52 

±  3.0 

139.11  and  39.45 

.004 

12 

±3.5 

147.41  and  31.15 

001 

3 

±  4.0 

155.72  and  22.84 

— 

— 

The  value  for  yo,  when  x  =  0,  is 


4834 


1161.    From 


2.5066x1.661 
Table  42  it  will  be  noted  that  the  value  for  2;  at  a:  =  0  is 

=  .399.    It  is  therefore  necessary  to  multiply  this  and  all 


of  the  other  ordinates  taken  from  this  table  by  the  factor 

—  =  2910.    Thus  the  ordinates  at  ±  0.5  a  will  have  the  values 
a 

2910  X  .352  =  1024,  2910  x  .242  =  704,  etc. 

The  ordinates  may  be  plotted  at  any  convenient  distances 
from  the  mean,  say  at  multiples  of  0.5 cr,  which  must  be  worked 

♦  See  Chapter  XVI,  section  8. 
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out  in  actual  scale  units.  The  value  for  yo  will,  of  course,  be 
taken  at  89.28,  while  the  ordinates  at  0.5  a  will  be  located  at 
89.28  db  .5(16.61),  or  at  97.58  and  80.98,  etc.  A  complete  list  of 
values  is  shown  in  Table  45  on  page  215. 

A  histogram  of  the  observed  frequencies  and  the  fitted  normal 
curve  have  been  plotted  on  the  same  background  in  Fig.  55. 


1200 


30  40  50  60  70  80 


90  100  110  120  130  140  150  IGO 

I.Q. 


Fig.  55.   Histogram  for  4834  intelligence  quotients  with  fitted  normal  curve 


The  agreement,  as  judged  by  mere  inspection,  appears  to  be 
rather  good,  but  this  method  of  comparison  is  worth  very  little 
in  determining  whether  or  not  a  particular  mathematical  curve 
adequately  describes  a  body  of  data.  The  accurate  method  is 
to  compare  the  discrepancies  in  frequency  (area)  between  t,he 
histogram  and  the  theoretical  curve  and  determine  whether  or 
not  the  differences  may  be  accounted  for  by  chance  fluctua- 
tions of  sampling.  This  test  for  goodness  of  fit  will  be  applied 
in  the  chapter  on  Sampling  (section  7). 
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6.  Some  Properties  of  the  Normal  Curve 

From  the  fact  that  the  normal  curve  is  a  continuous  function 
it  is  now  possible  to  find  the  probability  for  an  occurrence  be- 
tween any  two  limits,  Xi  and  X2.  The  actual  frequency  between 
these  limits  gives  the  number  of  favorable  ways  the  event  may 
happen,  while  the  total  frequency  gives  the  total  number  of 
possible  ways.   The  quotient  of  these  two  frequencies,  or 

Frequency  of  occurrence  between  xi  and  X2 
Total  frequency  of  all  occurrences 

then  furnishes  the  desired  measure  of  the  probability. 

In  case  the  unit-area  form  of  the  normal  curve  is  used,  the 
denominator  of  this  fraction  becomes  1,  and  the  probability  for 
an  occurrence  between  xi  and  X2  is  merely  the  area  between 
these  limits. 

This  area,  which  is  known  as  the  probability  integral,  may 
be  found  by  using  the  appropriate  values  of  ^o:  given  in 
Table  42  or  in  more  extended  tables  such  as  Pearson's. 

To  illustrate  the  use  of  Table  42  in  this  connection,  let  us 
find  the  probability  for  an  occurrence  between  1  a  and  2  a. 
This  is  represented  in  Fig.  56  by  the  shaded  area.  From  the 
table  the  area  from  a:  =  0  to  a:  =  2  is  found  to  be  .477,  while 
the  area  from  x  =  0  to  a:  =  1  is  .341.  The  required  area  and 
probability  is  therefore  the  difference  between  these  two  values, 
or  .136. 

The  same  reasoning  may  be  applied  in  the  case  of  a  distribu- 
tion of  observed  data  such  as  the  4834  I.Q.'s.  In  order  to  find 
the  probability  of  getting  an  I.Q.  between  130  and  140  in  such 
a  group  it  is  only  necessary  to  divide  36  (the  number  of  favor- 
able occurrences)  by  4834  (the  number  of  equally  likely  occur- 
rences), and  obtain  .0074  as  the  required  probability.  Thus,  if 
the  4834  I.Q.'s  were  recorded  on  little  tickets  and  mixed  up  in 
a  box,  the  chance  of  drawing  a  card  with  I.Q.  between  130  and 
140  would  be  .0074,  or  less  than  one  in  a  hundred. 
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The  probability  integral  is  also  useful  in  determining  the 
chances  that  an  occurrence  will  lie  within  or  without  a  given 
middle  range  about  the  mean.  Thus  the  probability  (from 
Table  42)  for  an  event  between  —  Sa  and  +  3 cr  is  the  value  of 
q:  at  X  =  3,  that  is,  2  X  .499,  or  .998,  while  the  probability  for 
an  occurrence  beyond  these  limits  in  either  direction  is  .002. 
By  more  extended  tables,*  these  two  values  are  .9973002  and 


\ 

i 

%.n^!f- 

^ — 

a;=o     x=\     x=2 
Fig.  56.   Illustrating  the  area  between  x=l  and  a:  =  2  on  a  normal  curve 

.0026998,  respectively.  The  probability  for  an  occurrence  be- 
yond lb  6  cr  is  .000000002,  or  only  twice  in  a  billion  trials. 

In  case  the  probable  error  is  used  as  a  unit  of  measurement  it 
is  possible  to  determine  the  probabilities  for  an  occurrence  be- 
tween the  given  multiples  of  P.E.  when  laid  off  on  either  side  of 
the  mean.  Thus  the  chance  of  a  deviate  within  ±  1  P.  E.  is  ^ 
(by  definition).  A  short  table  of  such  probabilities  is  given 
on  page  219. 

Another  interesting  property  of  the  normal  curve  makes  it 
possible  to  find  the  mean  of  the  portion  between  any  two  ordi- 
nates.   Let  the  equation  of  the  curve  be  taken  in  the  form 

_  p~  Y  I  Ordinate  of  the  normal  curve,  with  1    /qq'v 

~     /^  \   unit  area  and  standard  deviation   J 

*  Pearson,  Tables  for  Statisticians  and  Biometricians.   Cambridge  University  Press. 
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Table  46.   Probabilities  that  a  Deviate  will  lie  within  Certain 

Limits  on  a  Normal  Curve 


P.E. 

Probability  of  an  Occurrence  within  a 

Range  of  ±  a  Given  Multiple  of  P.E. 

.5 

.264 

1.0 

.500 

1.5 

.688 

2.0 

.822 

2.5 

.908 

3.0 

.957 

3.5 

.982 

4.0 

.993 

4.5 

.998 

with  unit  area  and  standard  deviation ;  let  ^i  and  Z2  be  the  ordi- 

nates  at  any  two  points  Xi  and  X2,  the  second  abscissa  having  the 

larger  value ;  let  1^2  be  the  area  between  these  ordinates ;  and 

let  1X2  denote  the  mean  of  the  inclosed  portion.    Then  it  may  be 

proved  *  that 

-        Zi  —  Z2     J  Mean  of  a  portion  of  a  normal  curve,  "I     /qc\ 
^  ^  ~"     -Ho        \  with  unit  area  and  standard  deviation  j     ^     -' 


Fig.  57.   Illustrating  the  mean  of  a  portion  of  a  normal  curve  between 

x  =  2  and  x  =  3 


*  For  any  continuous  function,  z  =  f{x),  the  mean  between  the  limits  x^  and  X2  is 
given 


by  r  ^  xzdx  I  I       zdx.  In  the  present  case,  z  = 


e        and  /      zdx  =  1^2. 


Xl  V2   IT 

The  integral  in  the  numerator  may  be  readily  evaluated,  giving     —  2       , 


or  Zi  —  Z2. 


Therefore  jij  = 


gl    ~  ^2. 
1«2 
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This  theorem  may  be  illustrated  by  finding  the  mean  of 
the  piece  included  between  ordinates  at  x  =  2  and  x  =  3. 
From  Table  42,  Zi  =  .054  and  Z2  =  .004.  The  value  for  ino  may 
be  found  by  subtracting  ia  for  Xi  from  .1  a  for  xo,  that  is  to 


X=-2  X=-X  I 


0     X  =+l 


1^2= -23 

P^G.  58.   Illustrating  the  mean  of  a  portion  of  a  normal  curve  between 

.r  =  -  2  and  j  =  +  1 

say,  .499  —  .477  =  .022  gives  the  area  i/?2  between  the  two  or- 
dinates.  The  required  mean  for  this  piece  is  therefore 

.054  -  .004       ,  o  o  /!?•     c-N 
1.^-2  = ^ =  +  2.3  (Fig.  50, 

or  2.3  standard  deviations  above  the  mean  of  the  whole  curve. 
With  Pearson's  tables  we  find 

.0539910  -  .0044318      ooit^oo 
'''''  = .0214002 =  ^'^^^^^' 

It  should  be  noted  that  1/^2  is  always  positive,  and  that  the 
sign  of  iJo  is  determined  by  the  difference  betw^een  the  ordinates, 
which  must  be  subtracted  in  the  order  indicated.  Thus  the 
mean  of  the  piece  between  .r  =  —  2  and  x  =  1  will  be  obtained 
by  adding  the  tw^o  values  for  ha  and  subtracting  the  larger 
from  the  smaller  ordinate,  that  is,  from  Table  42, 

-        054  —  *M'^      —   ISS 
'^■■'  =        .818        =  ^818-  =  -  "^-^  '-^  '«•  ^^^- 


THE  NORMAL  PROBABILITY  CURVE 


221 


7.  Representing  Data  on  a  Normal  Scale 

In  case  we  are  dealing  with  a  series  of  observations  with 
standard  deviation  a,  and  total  frequency  N,  formula  (85)  may 
be  modified  so  that  the  inclosed  area  is  a  fraction  of  the  total,  and 
the  mean  is  expressed  in  units  of  the  standard  deviation,  that  is, 


l£2 
(7 


Zl  —  Z2 

N 


'  Mean  of  a  portion  1 
-^  of  a  normal  curve,  }-     (86) 
with  area  =  N 


By  means  of  the  above  formula  it  is  now  possible  to  represent 
a  qualitative  series  of  observations  on  a  normal  scale,  assigning 
to  each  class  the  numerical  value  given  by  the  mean  of  each 
sub-group.  In  this  way  the  qualitative  series  has  been  converted 
into  a  quantitative  one,  the  assumption  being  that  the  law  be- 
hind the  data  is  the  normal  distribution.  This  method  is  of 
the  greatest  importance  because  it  makes  possible  the  applica- 
tion of  many  formulas  requiring  numerical  values  for  the  classes 
(see  Chapter  XIV). 

Any  other  curv^e  might  be  used  to  represent  such  data,  but  as 
indicated  at  the  beginning  of  this  chapter  the  normal  curve  is 
the  best  single  approximation  to  most  educational  data,  and 
very  fortunately  it  is  extremely  simple  to  apply. 

As  an  example,  let  us  represent  the  following  qualitative  se- 
ries on  a  linear  and  then  on  a  normal  scale.  The  data  are  general 
health  estimates  of  school  children  made  by  several  physicians. 


Table  47.  Health  Data  with  Percentage  Frequencies 


Health  of 

Child 

/ 

Percentage  / 

Verv  robust 

16 
199 
345 
115  • 
124 

16 

2.0 

Robust 

24.4 

Normal 

Rather  delicate 

Delicate       

42.3 
14.1 
15.2 

Very  delicate 

2.0 

815 

100.0 
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If  we  assume  that  the  attribute,  health,  is  distributed  with 
equal  frequency  along  a  scale,  the  resulting  series  will  form  a 
long  rectangle.  The  mean  of  each  piece,  occurring  at  the  middle, 
might  then  be  taken  as  a  numerical  measure  of  the  class.   This 


78 


197.5 

— ^e- 


427.5 

— n — 


699.5 

— ^ — 


815 

807 


V.D.   D.       R.D.  N.  R. 

Fig.  59.   Rectangular  distribution  of  the  health  series 


V.R. 


method,  however,  would  be  unsound  because  it  assumes  a  form 
of  distribution  totally  unlike  any  observed  for  such  traits. 

Assuming  that  health  is  normally  distributed,  the  series  may 
be  represented  as  in  Fig.  60.  It  is  now  possible  to  determine 
the  means  of  the  various  pieces  by  the  use  of  Table  43.  The 
need  of  such  a  table  becomes  apparent  when  it  is  noted  that 

areas  and  not  deviates  are 
furnished  by  the  data.  While 
it  is  better  to  use  more  ex- 
tended tables,  such  as  Kel- 
ley's  '■•'  or  Holzinger's,  the  work 
will  be  illustrated  by  Tables 
48  and  49,  the  figures  in 
parentheses  being  obtained 
from  Holzinger's  Table  XII. 
If  the  ordinates  are  desig- 
nated as  Zi,  Zo,  Zs  ■  ■  ■  Zt  it  is 
evident  that  Zi  and  Zt  are 
zero.  The  other  five  ordinates,  inclosing  various  pieces,  may  be 
obtained  by  reducing  the  areas  to  total  unit  area,  and  entering 
Table  43  with  the  proper  value  of  ^  a.  Thus  the  area  to  the 
right  of  Z()  is  .020,  so  that  ^a  =  .480 ;  the  area  to  the  right  of 
Z5  is  .264,  giving  i  a  =  .236 ;  while  the  area  to  the  right  of  2.1  is 
.687,  for  which  i  a  =  .187,  as  shown  in  Table  48. 


/.152 

.020y^^2 

/ 

.—1 

r-t 

.423 

^5   \ 
.244  \ 

V.D.    D.R.D.N.      R.     V.R. 

Fig.  60.   Representation  of  the  health 
data  on  a  normal  scale 


♦  T.  L.  Kelley,  Statistical  Method.    The  Macmillan  Company. 
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Table  48.   Showing  the  Calculation  of  the  Five  Ordinates  for  the 
Health  Data  Represented  on  a  Normal  Scale 


Ordinate 

Area  between 
Ordinates 

\a 

Value  of 
Ordinate 

z^ 

.020 

.50  (.500) 

.000 

«6 

.244 

.48  (.480) 

.048  (.0484) 

2^5 

.423 

.24  (.236) 

.324  (.3269) 

24 

.141 

.19  (.187) 

.353  (.3543) 

^3 

.152 

.33  (.328) 

.253  (.2550) 

22 

.020 

.48  (.480) 

.048  (.0484) 

2l 

.50  (.500) 

.000 

The  means  may  now  be  obtained  by  subtracting  the  proper 
ordinates  and  dividing  by  the  area  between  them.  The  work 
may  then  be  set  down  as  follows : 

Table  49.  Showing  the  Calculation  of  the  Means  of  the 

Health  Categories 


Mean 


6X7 

a 

5X6 
(T 

4X5 

(J 
3X4 

(T 
2X3 

<T 
1X2 

cr 


Value  from  3-Place  Table 


.048 -.000 


.020 
.324  -  .048 


=  +  2.40 


.244 
.353  -  .324 

.423 
.253  -  .353 

.141 
.048  -  .253 

.152 
.000  -  .048 

.020 


=  +  1.13 
=  +  0.07 
=  -0.71 
=  -1.35 
=  -2.40 


Value  from  4-Place  Table 


.0484  -.0000_ 

+  2.42 

.020 

.3269  -  .0484 
.244          ~ 

+  1.14 

.3543  -  .3269  _ 

+  0.06 

.423 

.2550  -  .3543 
.141 

-0.70 

.0484  -  .2550 

-1.36 

.152 

.0000  -  .0484 

-2.42 

.020 


As  a  check  on  the  computation,  the  products  of  the  means  by 
the  corresponding  areas,  when  added,  should  equal  zero  (the 
mean  of  the  whole  distribution),  for  example, 
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2.40  X  .020  +  1.13  X  .244  +  0.07  X  .423 

-  0.71  X  .141  -  1.35  X  .152  -  2.40  x  .020  =  +  .00002. 

The  check  in  this  case  is  accidentally  close. 

According  to  these  results  the  ''health"  difference  between  a 
typical  robust  and  a  typical  normal  child  is  1.06  a,  while  the 
difference  between  a  normal  and  a  rather  delicate  child  is  0.78  a. 
It  will  also  be  noted  that  the  mean  of  the  normal  health  group 
is  very  close  to  zero  (the  mean  of  the  whole  distribution)  and 
that  the  very  delicate  and  very  robust  groups  are  equally  diver- 
gent from  this  point. 

While  comparisons  such  as  these  are  often  of  great  value  in 
analyzing  a  body  of  qualitative  data,  the  chief  use  of  this  scaling 
method  is  in  studying  the  relationship  between  several  traits. 
It  is  possible,  for  example,  to  obtain  a  measure  of  the  rela- 
tionship (correlation)  between  health  and  general  nutrition,  or 
between  health  and  intelligence,  by  representing  the  pairs  of 
characters  on  normal  scales  (see  Chapter  XIV). 

8.  The  Scaling  of  Test  Questions 

The  normal  curve  has  been  widely  used  in  the  scaling  of  stand- 
ardized test  questions.  Assuming  that  the  ability  of  the  pupils 
is  measured  by  the  difficulty  of  the  exercises,  the  latter  may  be 
represented  on  a  normal  scale.  If  nearly  all  of  a  group  of  pupils 
solve  a  problem,  its  value  will  be  low ;  if  50  per  cent  do  an 
exercise  correctly,  its  value  will  be  at  the  mean ;  while  if  very 
few  succeed  on  an  item,  it  will  be  located  high  on  the  normal 
scale.  The  particular  scale  value  of  the  item  is  thus  determined 
by  the  per  cent  of  the  group  solving  the  problem  correctly. 

In  Fig.  61  the  percentage  of  correct  solutions  is  shown  by  the 
shaded  area,  and  the  value  of  the  item  is  given  by  the  corre- 
sponding abscissa.  In  order  to  obtain  this  value  for  this  ex- 
ample it  is  only  necessary  to  enter  Table  43  with  ^a=  .20, 
giving  X  =  .524.  The  problem  thus  has  a  difficulty  or  ability 
value  of  .524  standard  deviation  above  the  mean. 
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By  taking  the  mean  at  zero  it  will  be  noted  that  negative 
values  of  the  deviates  will  occur.  This  may  be  overcome  by 
shifting  the  origin  to  some  convenient  point,  say  5  cr  below  the 
mean,  as  shown  in  Fig.  61.  Such  an  arbitrary  origin  should  not  be 
confused  with  the  point  for  ''  just  no  ability  in  the  trait"  sought 
after  by  some  test  makers.  Just  as  temperature  is  measured  on 
the  Fahrenheit  scale  from  an  arbitrary  zero,  not  representing  the 


-SO" 

t 


Mean  ItemX      Difficulty  or  ability  scale 
0 .524 — Value  in  (T  units  fromj^/"- 


+50- 

t 


0 5 — 5.524  — ^Value  from-5a' 10 

Fig.  6L   Illustrating  the  scaling  of  test  questions  with  the  normal  curve 


point  for  no  heat,  so  educational  scales  may  be  taken  from  any 
convenient  reference  point,  not  representing  ''just  no  ability." 
It  is  possible  to  scale  the  items  one  at  a  time,  or  several  at  once, 
as  proposed  by  McCall.*  The  procedure  by  the  first  method 
may  be  further  illustrated  with  some  reading  questions  given 
to  a  large  group  of  twelve-year-old  pupils.  In  Table  50  the  first 
and  sixth  questions  will  have  negative  deviates  given  by  enter- 
ing Table  43  with  ^  a=  (.98 -.50)  and  (.75 -.50),  while  the 
other  two  questions  will  have  positive  deviates,  being  at  the 
right  of  the  mean.  The  final  scaled  values  are  obtained  by 
merely  adding  5  to  each  of  these  deviates. 

*  McCall,  How  to  Measure  in  Education.   The  Macmillan  Company. 
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Table  50.  Showing  a  Method  for  Scaling  Each  Test  Item 


Problem 

Per  Cent 
OF  Pupils 

ANSWERING 
CORRECTLY 

la 

X 

a 

Scaled  Value 

=  ^  +  5 

<7 

1 

98 
75 
46 

4 

.48 
.25 
.04 
.46 

-  2.054 

-  0.674 
+  0.100 
+  1.751 

2.946 

6 

4.326 

13 

5  100 

24 

6.751 

By  McCall's  method  it  is  necessary  to  rtote  the  percentage  of 
successful  repHes  to  at  least  0,  1,  2,  3,  •  •  •  questions,  the  items 
being  previously  arranged  in  rough  order  of  difficulty.  Thus, 
with  the  above  reading  material,  the  follov^ing  results  were 
obtained : 

Table  51.  Showing  McC all's  Method  of  Scaling  Test  Questions 


Number  of 

Questions 

Correct  =  Q 

Number 

OF  Pupils 

obtaining 

Given  Q 

Percentage 

OF  Pupils 

exceeding,  plus 

Half  those 

at  Q 

\a 

X 

Scaled  Value 

0 
1 
2 
3 

4 

21 

1 
3 
5 

7 
9 

17 

99.9 
99.5 
98.6 
97.3 
95.6 

43.2 

.499 
.495 
.486 
.473 
.456 

.068 

-3.090 
-2.576 
-2.197 
-1.927 
-1.706 

+  0.171 

1.910 
2.424 
2.803 
3.073 
3.294 

5.171 

Total    .... 

462 

In  order  to  obtain  the  percentage  of  pupils  above  a  given 
class  value,  McCall  has  added  one  half  of  the  number  of  pupils 
at  Q  to  the  number  exceeding  Q,  and  then  divided  by  the  total 
number  in  the  sample.  The  arithmetic  for  the  first  two  values 
in  the  above  table  will  then  be 


461  +  2  X  1  _ 
462 


=  .999, 


458  +  \  X  3  _    oor: 

462         "•^^'^^' 
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The  deviates  and  scaled  values  are  obtained  by  Holzinger's 
Table  XII  and  by  adding  5  to  eliminate  the  negative  signs. 
McCall,  however,  multiplies  these  last  values  by  10  and  calls 
them  T  scores. 

According  to  the  first  method  of  scaling,  the  score  of  a  pupil 
answering  the  first  four  questions  correctly  would  be  the  sum 
of  the  four  scaled  values.  By  McCall' s  method,  such  a  perform- 
ance would  be  scaled  by  assigning  the  T  score  corresponding 
to  Q  =  4  from  Table  51.  McCalFs  method  is,  therefore,  very 
convenient,  but  there  is  some  doubt  as  to  the  assumption  that 
different  sequences  of  problems  (for  example,  1,  2,  3,  4,  5,  •  •  •, 
1,  2, 4,  5,  6,  etc.)  obtained  by  various  pupils  have  the  same  value. 

It  should  be  noted  that  great  precision  in  scaling  test  mate- 
rial is  idle.  The  figures  above  have  been  put  down  as  they 
came  from  the  tables,  but  they  should  ordinarily  be  rounded 
off  to  one  decimal  place  at  most. 

Scaled  values  are  often  an  unnecessary  refinement  in  measur- 
ing large  groups  as  evidenced  by  the  high  correlations  between 
scaled  and  unsealed  items.  Professor  Douglass,*  for  example, 
found  a  correlation  of  about  .98  between  weighted  and  un- 
weighted algebra  scores,  a  result  which  is  much  higher  than 
the  reliability  of  the  tests  themselves.  He  concluded  that  the 
unsealed  values  give  the  relative  standing  of  the  pupil  with 
sufficient  accuracy  for  ordinary  testing  uses. 

In  the  case  of  individual  measurements,  scaled  values  also 
lose  much  of  their  significance  because  they  are  based  upon  a 
large  group  and  may  not  apply  to  a  single  person.  Thus  for 
the  whole  group,  problem  1  in  Table  50  has  the  value  2.9, 
while  problem  6  has  the  value  4.3.  For  a  given  individual, 
however,  it  is  not  improbable  that  the  two  items  are  equally 
difficult. 

♦  H.  R.  Douglass  and  P.  L.  Spencer,"  Is  it  Necessary  to  weight  Exercises  in  Stand- 
ard Tests?  ",  Journal  of  Educational  Psychology,  February,  1923,  p.  109.  Dr.  Scates 
and  the  writer  have  also  found  correlations  of  .994,  .995,  .997,  and  .998  between 
weighted  and  unweighted  scores,  the  number  of  items  weighted  varying  from  six 
to  ten,  and  the  weights  being  quite  different. 
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The  chief  advantage  in  scaling  by  the  above  methods  is  that 
test  results  are  thereby  expressed  in  comparable  units  from 
comparable  reference  points  (for  example,  a  r  of  60  on  any 
test  means  1  a  above  the  mean).  Weighted  values  may  also 
be  used  to  graduate  test  material  in  order  of  difficulty  or  to 
arrange  parallel  groups  of  items  such  as  spelling  words  of 
equal  difficulty. 

When  test  material  is  to  be  scaled  by  the  judgment  of  experts 
rather  than  by  the  performance  of  the  pupils,  the  normal  curve 
may  again  be  employed.    The  procedure  here  is  to  have  the 

judges  arrange  the  pupil 
specimens  (say  drawings)  in 
order  of  merit  according  to 
their  best  opinion.  If  50 
per  cent  of  the  judges  rate 
specimen  A  as  better  than 
specimen  B,  these  two  are 
regarded  as  of  equal  value  on 
the  assumption  that ''  equally 
rated  differences  are  equal 
unless  they  are  always  or 
never  noticed."* 

If  85  per  cent  of  the  judges 
rate  specimen  C  as  better  than  A,  then  the  difference  in  value 
between  A  and  C  is  obtained  by  finding  the  deviate  for  ^  a  = 
(.85-.50)  =  .35.  Thus  in  Fig.  62,  C  has  the  scaled  value  x  = 
1.04,  the  unit  being  the  standard  deviation  with  the  origin  at 
the  mean.  If  the  percentage  of  judges  rating  C  better  than  B 
is  83,  then  a  new  scaled  value  x  =  .95  may  be  averaged  with 
1.04  etc. 

By  calculating  similar  differences  for  all  pairs  of  specimens  a 
series  of  scaled  values  is  obtained.  The  origin  may  be  taken  at 
an  arbitrary  point  (such  as  —  5  o"),  but  is  often  selected  at  the 
specimen  which  most  judges  consider  worthless. 

♦  This  is  known  as  the  Cattell-Fullerton  Theorem. 
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1.04  1.48 

Fig.  62.   Illustrating  the  scaling  of 
items  for  a  product  test 


THE  NORMAL  PROBABILITY  CURVE 


229 


As  a  further  illustration  of  the  arithmetic,  two  items  may  be 
added  to  the  above  series.  Problem  D  is  rated  better  than  A 
by  93  per  cent  of  the  judges,  while  problem  E  is  regarded  by 
most  as  worthless.  Assuming  that  99  per  cent  of  the  judges 
rate  A  better  than  £",  the  value  of  the  latter  becomes  —  2.33. 
The  scaled  values  may  now  be  written  as  follows : 


Orthtm 

Values  of  Specimens 

E 

A,B 

C 

D 

A 

-50- 

E 

-2.33 
2.67 
0 

0 
5 
2.33 

1.04 
6.04 
3.37 

1.48 
6.48 
3.81 

The  last  row  of  numbers  is  probably  most  convenient  to  use, 
but  zero  is  then  only  a  rough  approximation  to  ''  just  no  ability." 

All  these  results  were  obtained  by  using  A  as  the  item  of 
comparison,  but  approximately  the  same  values  would  have  been 
secured  if  all  differences  had  been  computed  with  reference  to 
problem  E.  In  the  final  scale  it  is  usually  best  to  select  only 
those  items  which  differ  from  one  another  by  fairly  large  amounts 
(say  .5  0"),  because,  in  using  the  scale,  finer  differences  cannot 
be  readily  noted. 

EXERCISES 

1.  Find  the  probabilities  of  occurrences  within  the  following  ranges 
for  a  normal  curve.    Use  Holzinger's  Table  XI. 


Range 

Probability  {Ans.) 

-  2.5(7    to  -  1.5  0- 

.0606 

-  2.5(7    to  +  2.5(7 

.9876 

+  1.0(7    to  +  3.0(7 

.1574 

+  3.54(7  to  +  3.88(7 

.0001 

-  0.62  (7  to  +  2.79  (7 

.7298 

2.  Allowing  a  range  of  1.2  cr  for  each  of  the  five  marks  A,  B,  C,  D, 
and  E,  find  the  percentages  of  such  marks  under  a  normal  distri- 
bution. (3.46,  23.84,  45.14,  23.84,  3.46.  Ans.) 
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3.  Represent  the  following  data  on  a  normal  scale  and  find  the 
means  of  the  five  categories : 


Grade  of  School  Work 

Percentage 
Frequency 

Mean  (Ans.) 

A 

5 
21 
49 

18 

7 

+  2.062  a 

Ji 

+  1.054  a 

C         

—  0  013  (T 

D 

-  1.019  a 

E        

-  1.919  0- 

4.  Eighty-eight  per  cent  of  a  group  of  judges  rate  drawing  A  as 
better  than  drawing  B,  while  75  per  cent  rate  B  as  better  than  C. 
Assuming  that  99  per  cent  of  the  judges  have  rated  C  as  better  than 
X,  which  has  no  merit  whatsoever,  obtain  the  values  of  the  drawings 
C,  B,  and  A  with  respect  to  X. 

(X  =  0;    C  =  2.3263  0-;    5  =  3.0008(7;    A  =  4.1758(7.  Ans.) 

6.  In  a  large  group  of  children,  the  percentage  of  those  who  solved 
a  given  example,  with  five  specified  examples  considered  one  at  a 
time,  varied  as  follows :  94,  87,  61,  43,  11.  Find  the  a  value  of  each 
example,  using  as  origin  a  point  5  a  below  the  mean. 

(3.4452,  3.8736,  4.7207,  5.1764,  6.2265.  Ans.) 

6.  Find  the  percentage  distribution  of  five  marks,  using  a  range  of 
1  (7  for  each.  (6.06,  24.17,  38.30,  24.17,  6.06.   Ans.) 

7.  Verify  the  following  results : 


Number  of 

Percentage  of 

Per  Cent  exceed- 

Questions 

Pupils  obtaining 

ing,  PLUS  Half  those 

T  Score  (Ans.) 

Correct  =  Q 

Given  Q 

reaching  Q 

0 

2 

99 

26.7 

1 

6 

95 

33.6 

2 

12 

86 

39.2 

3 

18 

71 

44.5 

4 

20 

52 

49.5 

5 

14 

35 

53.9 

6 

12 

22 

57.7 

7 

10 

11 

62.3 

8 

6 

3 

68.8 

_^ 

8.  Calculate  the  ordinates  for  (.5  +  .5)^  and  compare  them  with 
those  of  the  normal  curve. 

9.  Fit  normal  curves  to  the  distributions  of  I.Q.'s  given  in  Table  55 
of  Chapter  XIII.    Use  columns  1,  2,  3,  and  4. 


CHAPTER  XIII 

SAMPLING  AND  RESPONSE  ERRORS 

1.  Introductory 

All  statistical  quantities  such  as  averages  and  measures  of 
relationship  are  based  upon  samples.  The  results  found  from 
one  sample  will  never  quite  agree  with  those  found  from  another, 
nor  with  those  from  the  whole  population  from  which  the  samples 
were  chosen.  In  determining  the  stability  of  a  given  measure  or 
in  comparing  the  results  from  different  groups  it  is  therefore 
important  to  know  the  probable  extent  of  such  fluctuations. 

Thus  a  correlation  of  .30  may  appear  to  indicate  some  rela- 
tionship between  two  traits,  but  if  on  taking  another  sample 
the  coefficient  is  found  to  be  .10,  we  can  place  little  confidence 
in  either  of  the  two  results.  Some  measure  of  the  likely  varia- 
tion from  sample  to  sample  is  clearly  desirable. 

Again,  in  the  case  of  a  control  experiment,  two  means  might 
be  obtained  for  comparison,  their  difference  being  the  test  of  the 
relative  superiority  of  two  methods  of  learning.  For  example, 
the  mean  gain  might  be  22  for  a  control  group,  and  20  for  a 
practice  group.  The  difference  is  2,  but  whether  or  not  it  is 
of  any  significance  remains  to  be  shown.  It  might  be  that  by 
repeating  the  experiment  the  difference  would  come  out  to  be 
—  3  in  favor  of  the  other  group.  Here  also  a  critical  test  of  such 
differences  under  sampling  is  necessary. 

The  stability  of  a  statistical  constant  from  sample  to  sample 
is  often  called  its  reliability*  and  is  measured  by  the  use  of 
sampling  formulas  to  be  discussed  in  the  present  chapter.   On 

♦  This  term  should  not  be  confused  with  the  reliability  coefficient  ni  for  a  test. 
It  might  therefore  be  better  to  use  the  expression  "sampling  reliability"  for  the 
former. 

231 


232  STATISTICAL  METHODS  IN  EDUCATION 

account  of  the  rather  elaborate  mathematics  involved  only  a 
few  of  the  proofs  of  these  formulas  will  be  given,  but  their  use 
and  interpretation  as  applied  to  a  variety  of  educational  prob- 
lems will  be  treated  at  some  length. 

Sampling  formulas  as  applied  to  statistical  data  are  usually 
approximations,  their  accuracy  depending  on  certain  assump- 
tions in  the  proofs  and  especially  upon  the  number  of  cases 
involved.  The  chief  danger  in  using  such  formulas  without 
being  familiar  with  the  proofs  may  be  avoided  by  never  apply- 
ing them  to  a  small  number  of  cases  (say  less  than  thirty). 

In  the  last  section  of  this  chapter  some  of  the  current  formulas 
for  dealing  with  response  errors  will  be  presented.  As  noted 
in  Chapter  V,  response  errors  are  due  to  the  variability  of  per- 
formance within  the  individual  measured  or  tested. 

2.  Sampling  Error  in  the  Mean 

If  the  true  mean  of  an  indefinitely  large  number  of  observa- 
tions be  denoted  by  M  and  their  standard  deviation  by  o-j-, 
and  if  the  mean  of  a  randomly  drawn  sample  of  N  individuals 
be  represented  by  Mi,  the  difference  M  —  Mi  is  known  as  the 
sampling  error  in  the  mean.  It  can  be  shown  theoretically  that 
if  repeated  samples  of  N  be  randomly  drawn  from  the  popula- 
tion, the  differences  M  —  Mi  will  be  distributed  around  zero 
with  a  standard  deviation  given  by  the  formula 

CTx  f  standard  error  ^ 

^M    ^7/^*       I    of  the  mean    j         (^7) 

If  the  size  of  the  samples  is  large  the  distribution  of  M  —  Mi 
tends  to  follow  a  normal  curve  even  though  the  population 
sampled  is  not  normal. 

*  A  good  proof  of  this  formula  is  given  in  Jones's  "First  Course  in  Statistics," 
p.  153.  G.  Bell  &  Sons,  Ltd.,  London,  192L  The  reasonableness  of  the  formula  is  at 
once  apparent  from  the  fact  that  a  small  dispersion  and  a  large  number  of  cases 
decrease  the  size  of  Ca/. 


SAMPLING  AND  RESPONSE  ERRORS 


233 


As  an  approximation  to  an  indefinitely  large  number  of  cases 
let  us  assume  that  we  have  50,000  observations  of  a  certain 
variable  with  the  mean  equal  to  M,  and  that  samples  of  500  be 
drawn.  The  means  of  these  samples,  which  may  be  denoted  by 
Ml,  M2,  Ms,  .  .  .  Ml 00,  will  be  distributed  about  M  in  a  fre- 
quency curve  resembling  that  which  would  have  been  found 
had  the  number  of  samples  been  increased  indefinitely.  A 
hypothetical  distribution  of  such  means  is  shown  in  Fig.  63. 
The  mean  of  all  the  sam- 
ples is  148,  and  the  stand- 
ard deviation,  aM,  is  1.71. 

Now  let  us  assume  that 
one  of  the  samples  of  500 
cases  furnishes  a  mean  Mi 
equal  to  146,  and  a  stand- 
ard deviation  cTx,  equal 
to  37.12.  By  substituting 
these  values  in  formula 
(87)  we  then  find  aM  — 1.66. 
If  the  means  and  standard 
deviations  from  other  sam- 
ples had  been  used  in  this 
formula,  very  nearly  the 
same  results  would   have 

been  obtained  for  a.M  because  ax  will  vary  but  slightly  from 
sample  to  sample,  provided  the  size  of  the  sample  is  large. 

It  thus  appears  that  in  dealing  with  only  one  sample  the 
mean  of  the  whole  population  is  unknown,  but  may  be  approxi- 

a, 


143  144  145  146  147  148  149  150  151  152  153  154 


Fig.  63.   Hypothetical  distribution  of 
means  from  one  hundred  samples 


mated  by  il/i,  and  that  the  formula  aMi  = 


Vn 


gives  the  best 


obtainable  approximation  to  the  true  standard  deviation  aM- 
The  probable  error  of  the  mean  is  given  by  the  formula 


D   r  c*r/iK    ^'        f  Probable  error  1 

P.E.M=  .6745  — =.  ^      -  .,  \ 

y/l>^      L    of  the  mean    J 


(88) 
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If  the  true  values  for  M  and  ax  were  known  it  would  then  be 
possible  to  find  a  range  on  the  normal  scale  within  which  it  is 
almost  certain  that  an  observed  mean  Mi  must  lie.    In  actual 

practice,  however,  it  is  Mi 
and  not  M  that  is  known,  so 
that  this  argument  must  be 
reversed. 

The  theoretical  curve  in 
Fig.  64  represents  the  inverse 
probability  for  various  posi- 
tions of  the  true  mean  when 
Ml  is  known.  The  value  for 
P.E. Ml  by  formula  (88)  is 
.6745  X  1.66  =  1.12.  Since 
half  the  area  of  the  curve 
lies  between  Mi  —  P.  E.  and  Mi  +  P,  E.,  or  between  144.88  and 
147.12,  the  probability  that  the  true  mean  lies  between  these 
limits  is  .5,  and  the  result  is  ordinarily  written  Mi  =  146  ±1.12. 
By  similar  argument  we  find  that  the  chances  are  over  99  in 
100  that  the  true  mean  will  lie  in  the  range  Mi  ±4  P.E.,  or 
between  141.52  and  150.48  as  shown  in  Table  52.  This  range 
is  the  usually  accepted  zone  of  safety. 


J 

1 

f 

II 

\ 

\ 

141.52  143.76    146.   148.24   150.48 
142.64  144.88   147.12   149.36 

Fig.  64.   Illustrating  various  ranges 
of  probable  error  on  a  normal  curve 


Table  52.  Probabilities  that  the  True  Mean  will  lie  within 

A  Given  Range 


Range 

Probability  that  M  lies  within 
Given  Range 

Ml  ±  1  P.E.  (144.88-147.12) 
Ml  ±2  P.E.  (143.76-148.24) 
Ml  ±  3  P.E.  (142.64-149.36) 
Ml  ±  4  P.E.  (141.52-150.48) 
Ml  ±  5  P.E.  (140.40-151.60) 

.500 
.822 
.957 
.993 
.999 

The  calculation  of  probable  errors  of  the  mean  given  by 
formula  (88)  is  facilitated  by  the  use  of  tables  giving  the  values 


of  Xi  = 
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The  probable  error  is  obtained  by  multiply- 


ing the  observed  value  of  (Tx  by  the  tabled  value  of  Xi-  Thus, 
for  (Tx  =  13.1  and  N  =  147,  we  find  from  Holzinger's  Tables 
for  Students,  Table  IX,  that  Xi  =  -0556.  The  value  for  P.  E.m 
is  therefore  .0556  X  13.1  =  .728. 


3.  The  Probable  Error  of  the  Difference 
BETWEEN  Two  Means 

One  of  the  most  useful  formulas  in  sampling  is  that  for  testing 
whether  or  not  small  differences  may  have  arisen  from  chance. 
The  formula  may  be  employed  with  a  variety  of  statistical 
measures,  but  is  most  frequently  applied  in  the  case  of  the  mean. 

If  the  variables  in  two  groups,  and  hence  their  means,  are 
quite  independent  of  one  another  the  probable  error  of  the  dif- 
ference Ml  —  Mo  is  given  by  the  formula 

C  Probable  error  of  the"] 

P.  E.MI-M2  =  ^(P.  E.MiY  +  (P-  E.MiY-  \  difference  between  two  y  (89) 

[    uncorrelated  means   J 

The  use  of  this  formula  may  be  illustrated  in  the  case  of  a 
control  experiment  in  the  teaching  of  physics.  Two  groups  of 
pupils  were  equated  with  respect  to  intelligence  and  initial 
ability  in  a  type  of  high-school  physics.  After  teaching  one 
group  by  the  lecture  method  and  the  other  group  by  the  dem- 
onstration method  a  final  test  was  given  and  results  found  as 
shown  in  Table  53. 

Table  53.   Data  from  Physics-Teaching  Experiment 


Population 

Mean  intelligence  score 

Mean  score  on  initial  physics  test  .  .  . 
Mean  score  on  final  physics  test  .  .  .  . 
Standard  deviation  for  final  physics  test 
Probable  error  of  .1/ 


Lecture  Group 


Ni  =    37 
137 
74.3 
Ml  =    91.43 
0-1  =      7.08 
P.E.Mi  =        .785 


Demonstration 
Group 


N2  =    41 
138 
74.3 
M2  =    89.64 
(72  =      7.23 
P.E.M2  =        .761 
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The  probable  errors  of  the  means  are  given  by  formula  (88). 
Substituting  these  values  in  formula  (89),  we  find  that 

P.  E.M^-M2  =  V(.785)^+(.761)^  =  1.09, 

the  arithmetic  being  quickly  done  with  a  table  of  squares. 
The  difference  between  final  scores  may  now  be  written 

Mi-M2  =  91.43  -  89.64  =  1.79  ±  1.09. 

Such  a  difference  is  regarded  as  insignificant,  or  such  that  it  is 
not  unlikely  that  the  true  difference  is  zero.   This  is  illustrated 

in  Fig.  65.  Speaking  approx- 
imately, since  the  number 
of  observations  is  small,  the 
probability  that  the  true  dif- 
ference lies  outside  a  range  of 
±2P.E.,  or -.39  to  3.97, is 
about  .18  by  Table  54.  The 
probability  that  the  true  dif- 
ference will  be  outside  the 
range  0  to  3.58  may  be  had 
from  Table  54  by  entering 
X         1.79 


, 

/ 

/ 

\ 

o 

II 

w 

\ 

. 

0         l.T'J       3.58 
-.39    .70  2.88  3.97 


with 


=  1.64,  the 


Fig.  65.    Illustrating  the  probability 
that  an  observed  diiTerence  will  be  as 

low  as  zero  or  as  high  as  3.58  _ 

RE.       1.09 

result  being  approximately  .27.  The  chances  are,  therefore, 
approximately  one  in  four  that  the  true  difference  will  be  as 
small  as  0  or  as  large  as  3.58. 

In  view  of  the  above  test  the  whole  study  is  to  be  regarded 
as  inconclusive.  We  have  no  right  to  ascribe  the  observed  dif- 
ference of  1.79  to  the  superiority  of  the  lecture  method  when  it 
can  be  readily  accounted  for  by  chance  fluctuations  in  sam- 
pling. It  should  also  be  noted  that  there  are  a  large  number  of 
variable  factors  to  be  controlled  in  such  an  experiment.  These 
factors  can  never  be  perfectly  controlled  and  will  undoubtedly 
affect  the  final  result  to  some  extent.  It  is  assumed  that  the 
errors  in  sampling  are  independent  of  these  factors. 
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Table  54.  Probabilities  of  the  Occlhrence  of  Deviations  Relative 
TO  THE  Size  of  the  Probable  Error 


Probability  of 

Probability  of 

X 

A  Devl\tion 

X 

A  Deviation 

P.E. 

BEYOND  ±  ^%- 
P.E. 

P.E. 

beyond  ±  -^ 

1.0 

.5000 

3.0 

.0430 

1.1 

.4581 

3.1 

.0365 

1.2 

.4183 

3.2 

.0309 

1.3 

.3806 

3.3 

.0260 

1.4 

.3450 

3.4 

.0218 

1  5 

.3117 

.2805 

3  5 

0182 

1.6 

3.6 

.0152 

1.7 

.2515 

3.7 

.0126 

1.8 

.2247 

3.8 

.0104 

1.9 

.2000 

3.9 

.0085 

2.0 

.1773 

4.0 

.0070 

2.1 

.1567 

4.1 

.0057 

2.2 

.1378 

4.2 

.0046 

2.3 

.1208 

4.3 

.0037 

2.4 

.1055 

4.4 

.0030 

2.5 

.0918 

4.5 

.0024 

2.6 

.0795 

4.6 

.0019 

2.7 

.0686 

4.7 

.0015 

2.8 

.0589 

4.8 

.0012 

2.9 

.0505 

4.9 

.0009 

The  general  rule,  already  noted,  is  that  a  difference  or  a 
statistical  constant  of  any  sort  is  not  significant  unless  it  is 
at  least  four  times  its  probable  error. 

Table  54  gives  the  probabilities  for  deviations  greater  than 
X  ,  ,        ,1  X      ^  .  ,  ^     X 


and  less  than  — 


for  various  values  of 


that  is, 


P.E. P.E.  P.E. 

the  fraction  of  the  area  under  a  normal  curv^e  beyond  these  limits. 


4.  The  Probable  Errors  of  Certain  Constants 
FOR  A  Normal  Distribution 

The  probable  error  of  the  mean  may  be  used  for  any  form  of 
distribution,  but  in  the  case  of  certain  other  constants,  it  is 
assumed  in  the  proofs  that  the  distribution  is  normal.  The 
following  formulas  should  therefore  be  used  only  in  case  the 
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observed  distribution  from  which  the  constants  are  obtained 
approximates  the  normal  probability  curve. 

The  probable  error  of  the  median  is  given  by  the  formula 

D   r        _  .84535  (Jx       T  or^oti  n  r  T  Probable  error ^     ,^^,, 

P.E.Md  = 7= — =  1.2533  P. E.M.    i    t  ^.  J-      ^    (90) 

■yjlsl  Loi  the  median  J     ^     ^ 

Inasmuch  as  the  sampling  error  in  the  median  is  about  25  per 
cent  more  than  in  the  mean,  the  greater  reliability  of  the  latter 
is  at  once  apparent.  For  certain  very  peaked  (leptokurtic)  dis- 
tributions the  median  may  be  more  reliable,*  but  for  the  large 
majority  of  problems  the  distributions  are  roughly  normal  and 
the  mean  is  to  be  preferred. 

The  standard  deviation  is  one  of  the  most  reliable  of  all  sta- 
tistical constants,  its  probable  error  being  given  by 

.6745  a      .4769  a  _  [Probable  error  of  the"!   ,^^^ 

P.E.(y  =  — . = r=—  =  . 7071  P. E,M' <    .      1     ..     ...       m9D 

's/^W  ^Jn  ^  standard  deviation  j   v*^-^/ 

In  case  P.E.m  is  also  required,  the  last  form  on  the  right  is 
probably  the  most  convenient  for  computation. 

The  coefficient  of  variation,  V,  has  for  its  probable  error 

.6745  V 


P.E,^ 


V2Ar 


Ji    I   o/JLY  V    /Probable  error  of  the] 

I  \100/   j        I  coefficient  of  variation  j  (^^^ 


The  calculation  is  facilitated  by  the  use  of  Pearson's  Tables  V 
and  VI,  which  give  the  values  of  X2  ^^^  ^-  The  formula  may 
then  be  written 

r  67451   r     r  /  V  \2lll  rProbable  error] 

P.E.v=\'-^^y\v   1  +  2    ~       2  Ux.^/.  ^ofFwithPear-K93) 
lV2iVj   I     L  UOO/  J   J       ^^^     [    son's  Tables    J 

The  probable  error  of  the  correlation  coefficient  is 

PE     ^i^IiMLzZ!}.    r  Probable  errort  of  the  I    .^.. 
y/Jj  \  correlation  coefficient  j    ^      * 

*  Yule,  Introduction  to  Statistics,  p.  338. 

t  The  student  is  warned  that  this  formula  should  not  be  applied  when  N  is  small 
and  at  the  same  time  r  is  larKe.  Misleading  results  may  follow  for  such  Ciises,  as 
iV  -  20  and  r  =  .5,  N  =  50  and  r  =  .8,  or  iV  =  100  and  r  =  .9. 
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Complete  tables*  for  this  error  have  been  worked  out  by  the 
writer  for  every  value  of  N  from  20  to  100,  and  by  tens  there- 
after up  to  1000.  A  shorter  table  is  also  found  in  Table  X  of 
Holzinger's  Statistical  Tables  for  Students. 

An  approximate  value  for  the  probable  error  of  the  correla- 
tion ratio  is 

.6745(1  -yf)^     r  Probable  error  of  the  j    ,   .. 
*^  ~~  's/n  ^      correlation  ratio      J 

SO  that  the  above  tables  may  also  be  used  for  this  measure  of 
association. 

In  the  case  of  the  regression  coefficients  bxy  =  r  —  and  hy:c  =  r  —* 

the  probable  errors  are  ^ 

o-^  Vl  -  r2 

^y       w  N        I  Probable   errors] 

/ <  of  regression  co-  y 

and  P.E.by,  =  .Q7^5^  ~^  »   ^        efficients        j    (9g|^) 

Similar  formulas  are  applied  in  the  case  of  partial  regression 
coefficients  (see  Chapter  XV) ;  that  is, 

D  JT  —   CTAJ;  _^J_lL*       /  Probable  error  of  higher-order  ^ 

^-  ^'bi2 .  k  -  •b74D  ^^  ^  ^  ^'  <^         regression  coefficient         j    ^^^^ 

k  being  any  collection  of  secondary  subscripts  other  than  1  or  2. 
These  last  formulas  should  not  be  confused  with  formulas  (45) 
and  (46),  which  give  the  probable  errors  of  estimate  of  a  single 
score  by  the  lines  of  regression. 

In  testing  for  linearity  of  regression,  the  probable  error  of 
d  =  T]'-  —  r-  has  already  been  used.   The  formula  is 

P.  E.s  =  ^  ^-^15^^  V(T1^  -  r^){  (1  -  r]^r  -  (1  -  r^r  + 1} .  (98) 

V  A^  {Probable  error  of  y]"^  ~  r^} 


If  Tj-  —  r-  is  to  be  less  than  three  times  its  probable  error,  the 
above  expression  reduces  to  formula  (67)  of  Chapter  X. 

*  Karl  J.  Holzinger,  Tables  of  the  Probable  Error  of  the  Correlation  Coefficient, 
Tracts  for  Computers  No.  XII,  p.  35.    Cambridge  University  Press,  England,  1925. 
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5.  Some  Applications  of  Probable  Error  Formulas 

One  important  use  of  the  sampling  theory  is  to  determine 
whether  or  not  two  or  more  samples  belong  to  the  same  or  to 
different  types  of  populations.  This  may  be  illustrated  in  the 
case  of  the  distribution  of  4834  intelligence  quotients  given  in 
Table  20.  The  total  distribution  may  be  broken  up  into  the 
sub-groups  given  in  Table  55. 

From  the  means  and  standard  deviations  at  the  bottom  of  the 
table,  we  may  now  test  the  difference  between  various  groups 
designated  from  1  to  6.  If  A  and  B  are  any  two  independent 
measures,  formula  89  becomes 


Using  this  formula  together  with  (88)  and  (91)  we  find  : 


Ml  -  M2  =  -  5.52  ±  V(.27)2  +  (.34)2  ^^  _  5  52  _£.  .43 
and     0-1  -  (72  =  5.98  ±  V(.19)2  +  (.24)^  =  5.98  ±  .31, 

both  differences  being  clearly  significant.  The  grade  and  high 
school  city  children  are  thus  to  be  regarded  as  distinctly  dif- 
ferent intellectual  types,  the  differences  being  probably  due  to 
selection. 

By  similar  calculations  we  obtain : 

Ml-  M3  =  9.34  ±  .37,  ci  -  0-3  =  1.07  ±  .26, 

M2-  M3  =  14.86  ±  .42,  0-2  -  (73  =  4.91  ±  .31. 

Since  all  these  differences  are  significant,  the  three  white  groups 
are  to  be  considered  as  samples  from  essentially  different  types 
of  populations. 

The  means  for  the  two  negro  groups  are  found  to  be  signifi- 
cantly lower  than  those  for  any  of  the  white  groups.  The  differ- 
ence M4  —  Ms,  (2.64  lb  .78),  does  not  prove  to  be  significant  by 
the  usual  test.  From  Table  54,  however,  it  will  be  found  that 
the  odds  are  about  45  to  1  that  city  and  country  negroes  are 
to  be  regarded  as  distinct  intellectual  types. 
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It  is  evident  from  the  above  comparisons  that  all  five  groups 
making  up  the  total  are  to  be  regarded  as  samples  from  quite 
distinct  population  types.  This  lack  of  homogeneity  doubtless 
accounts  in  part  for  the  fact  that  group  6  does  not  furnish  a 
good  example  of  a  normal  curve. 

Another  application  of  the  formula 

P,E.A-B  =  ^{P.E.Ay-\-{P.E.By 

may  be  made  in  the  comparison  of  correlation  coefficients.  In 
the  same  number  of  the  Journal  of  Educational  Psychology  two 
writers  *  presented  correlations  between  mental  ages  on  the  Binet 
and  the  Herring  intelligence  tests.  Dr.  Herring  gives  the  value 
r  =  .987  ±  -002,  obtained  from  116  twelve-year-old  children,  and 
Dr.  Avery  finds  as  his  highest  correlation,  r  =  .824  i  -031,  from 
a  group  of  48  first-grade  children.  These  two  correlations  are 
independent,  since  they  were  obtained  from  different  groups. 
The  difference  by  the  above  formula  is  then  .163  zb  -031,  which 
is  more  than  five  times  its  probable  error,  and  therefore  signifi- 
cant. A  probable  explanation  t  of  the  difference  between  these 
correlations  lies  in  the  fact  that  one  of  the  tests  is  much  more 
reliable  than  the  other  when  applied  to  very  young  children. 

In  case  the' measures  A  and  B  are  correlated  the  formula  for 
testing  the  significance  of  the  difference  A  —  B  becomes 


P.  E.A-B=  ^{P-  E.aY  +  {P.  E.bY  -  2  Rab(P.  E.a) (P.  E.b),      (99) 

{Probable  error  of  difference  with  correlated  measures} 

where  Rab  is  the  correlation  between  the  sampling  errors  in 
A  and  B. 

For  two  means  Mi  and  Mo  from  correlated  material,  the 
correlation  between  the  sampled  means,  RmiM2,  is  equal  to  ri2, 
which  is  the  correlation  between  the  observed  variables,  so  that 

*  John  P.  Herring,  "Reliability  of  the  Stanford  and  the  Herring  Revision  of  the 
Binet-Simon  Tests,"  and  A.  T.  Avery.  "Comparison  of  Stanford  and  Herring 
Revisions  Given  to  First-Grade  Children,"  Journal  of  Educational  Psychology, 
April.  1924. 

t  It  is  also  possible  that  formula  (94)  does  not  apply  when  r  =  .987,  and  N  =  116. 
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P.  E.M^  -  A/2  =  V(P.  £.M,)2  +  {P.  E.M^)'^  -  2  ri2  P.  E.M,  P.E.m,.  (100) 
{Probable  error  of  difference  between  means  where  correlated; 

This  formula  may  be  illustrated  by  a  comparison  of  the  length  of 
the  left  forearm  for  1063  English  males  and  their  adult  sons.* 
The  results  found  were 

Ms  =  18.52''  ±  0.021",  and  M^  =  18.31"  ±  0.019", 

while  rps  was  equal  to  .421,  the  size  of  forearm  in  father  and  son 
showing  considerable  correlation.  Substituting  in  formula  (100) , 
we  find  that 


P.  E.  M^  -  3/,  =  V(.021)2  +  (.019)-'  -  2(.421)(.021)(.019)  =  .022. 

The  difference  may  then  be  written  0.21"  ±  -022.  Since  this  is 
about  nine  times  its  probable  error,  there  is  no  doubt  that  the 
sons  of  the  professional  English  class  were  substantially  differ- 
entiated from  their  fathers  by  a  sHghtly  longer  forearm. 

6.  The  Probable  Errors  of  Observed  and 
Percentage  Frequencies 

In  comparing  the  frequencies  between  two  groups  it  is  often 
convenient  to  reduce  them  to  percentages  as  in  the  table  on 
page  244  taken  from  columns  1  and  2  of  Table  55. 

If  /  denotes  an  observed  frequency,  its  probable  error  is  given 
by  the  formula 

'■^  ~  '  \      \         N/       1  observed  frequency  ;    ^•'- '-'■'•/ 

while  for  a  percentage  frequency  fp  =  >  we  have 


PF       -    R7d'S       //'(lQO-//>)  X     r  Probable  error  of  a  perO 
P.E.j^  -  .6745  ^ I       centage  frequency      /    (^02) 


♦  Biomefnka,  Vol.  II,  p.  370.  . 

t  This  formula   may  be  derived  from  equation  (105)  by  setting  p=  ^ 

q  =  ll  —  'j.    For  a  complete  and  excellent  proof  see  Jones,  op.  cit.,  p.  151. 


t  Derived  from  formula  (106)  by  finding  the  P.E.  of  100  p,  or  fp. 
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Table  56.    Frequency  Percentages  of  I.  Q.'s  for  Grade  and 
High  School  White  Children 


I.Q. 

Frequency  Percentages 

Grade  Schools 

High  Schools 

150-160-   

0.1 

0.6 

1.7 

4.7 

11.3 

22.5 

26.5 

17.9 

11.2 

2.6 

0.6 

0.2 

0.1 

140-150-   



130-140-  

0.3 

120-130-   

3.1 

110-120-  

17.5 

100-110-  

39.3 

90-100-  

28.5 

80-90-      

10.8 

70-80-     

0.5 

60-70-     

50-60-     



40-50-     



30-40-     



Total      

100.0 

100.0 

Applying  formula  (102)  to  the  percentage  frequencies  in  the  in- 
terval 100  to  110,  we  find 

/39.3(60.7) 


39.3  ±  .6745 


389 


,  or  39.3  di  1.67, 


and 


22.5  ±  .6745 


n/S 


5(77.5) 
1560 


.  or  22.5  ±  0.71, 


P.E.(diff.)  =  V(1.67)2-f-  (0.71)2:=  1.81. 


The  difference  39.3  -  22.5  may  therefore  be  written  16.8  ±  1.81. 
We  may  conclude  that  a  significantly  higher  percentage  of 
high-school  pupils  is  found  in  the  group  with  I.  Q.'s  between 
100  and  110. 

Formula  (101)  is  often  useful  in  comparing  observed  with 
theoretical  frequencies.  Thus  in  Fig.  55  the  area  under  the 
normal  curve  from  80  to  90  is  larger  than  that  given  by  the 
column  of  the  histogram.  In  order  to  find  the  area  under 
the  curve  it  is  necessary  to  express  the  class  limits  as  deviates 
from  the  mean  and  enter  a  table  of  areas  such  as  Holzinger's 
Table  XI.   The  arithmetic  will  then  be  as  follows : 
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xi      80-89.28         ^^^^       X2      90-89.28       ,  ^.,,, 
^  =  -16:61— =  -^-^^^'       -^=       16.61      =  +  ^'^^'> 
J  a  =.2120,*  I  a  =.0172. 

Therefore,  the  normal  frequency  is  4834(.2120  +  .0172),  or  1108. 
From  formula  (101)  the  probable  error  of  the  observed 
frequency  1059  is  .6745  Vl059  x  .7809  =  19.  The  difference 
1108  -  1059  =  49  ±  19  might  therefore  be  attributable  to  the 
fluctuations  of  sampling. 


7.  The  Chi-Square  Test 

In  the  case  of  a  whole  frequency  distribution  such  as  for  the 
4834  I.  Q.'s,  a  comparison  of  the  observ^ed  and  theoretical  fre- 
quencies may  be  made  by  Pearson's  Chi-Square  Test.  Any 
such  distribution  is  to  be  regarded  as  a  sample  from  a  much 
larger  group.  The  problem  is  then  to  determine  whether  or  not 
the  fitted  curve  is  a  sufficiently  good  description  of  the  observed 
data  within  the  fluctuations  of  sampling. 

The  test  is  made  by  obtaining  all  the  differences  between 
observed  and  theoretical  frequencies,  substituting  the  result  in  a 
formula,  and  determining  by  a  table  the  probability  that  ran- 
dom sampling  would  give  as  bad  a  fit  or  worse. 

If  the  observed  frequencies  are  denoted  by 

fl,  f'2,  fs,    '   '    •  j'n 

and  the  corresponding  theoretical  frequencies  by 

/l,  /2,  /s   •    •    •  Sn, 

the  value  for  x^  may  be  written 

/  =  ll        //  J        I    function    J      ^        ^ 

*  These  values  have  been  obtained  from  Table  XI  by  linear  interpolation,  that 
is,  when  -  =  .55,  ^  a  =  .2088  and  when  -  =  .56,  \a  =  .2123.   The  value  of  |  a  for 

(J  (T 

^  =  .559  is  therefore  .9  of  the  difiference  .0035  +  .2088,  or  .2120. 

G 


246 


STATISTKWL   MMTHODS   IN    KDIUWTION 


Before  taking  up  the  probability  tost  we  shall  next  work  out  x* 
for  (lu»  distribution  of  ISiM  l.CJ.'s  litttnl  by  a  normal  curvi'. 

In  (leterniinin^i:  the  values  of  ft  it  will  W  found  convenient 
to  obtain  (he  fraetional  area  from  [\w  mean  to  (he  limits  of  the 
various  grou[)s.  ami  then  subtraet  thes(»  \  alues  sueeessively  and 
multiply  by  N  to  .i^i\e  the  theoretical  frequencies  comparable 
with /V  The  com[)lete  arithmetic  {ov  the  \  alues  o{  j\  is  shown 
in  the  following  table: 


Tablk  57.  Showini;  the  Calculation  ok  Nokmai.  FF?b:QUKNciKs 


Group  Limits 
A'l 

A'(  "  M 

.\kka  khom 

Arra  in 

(iHoiir.s 

ft  =  T*A8T 

\  Ai.rKs 

(<r  a  16.61) 

A'l-tTuA'i 

X  4^4 

160  .    . 

+  70.72 

+  4.26 

.5000 

.0001 

0.5 

150  .    . 

-f  60.72 

+  3.66 

.4999 

.0010 

4.8 

140  .    .    , 

+  50.72 

+  3.05 

.4989 

.0060 

29.0 

130 

+  -10.72 

+  2.45 

.4929 

.0251 

121.3 

120 

+  30.72 

+  1.85 

.4678 

.0734 

3;>4.8 

no 

+  20.72 

+  1.25 

.;i944 

.ir.22 

73:>.7 

100 

+  10.72 

+  0.65 

.2422 

.2262 

I093.r> 

90 

+  0.72 

+  0.04 

.0160 

.2283 1 

n03.6 

80 

-  9.28 

-  0.56 

.2123 

.1617 

796.2 

70 

-  19.28 

-  1.16 

.3770 

.0838 

40;).! 

60 

-  29.28 

-  1.76 

.4608 

.0301 

145.5 

50 

-  39.28 

-  2.36 

.1909 

.0076 

36.7 

40 

-  49.28 

-  2.97 

.  I9S;> 

.0013 

6.3 

30 

-  59.28 

-3.57 

.1998 

.0002 1 
1.0000 

1.0 

Total 

4831.0 

The  only  point  where  any  ditliculty  is  likely  to  arise  is  in  passing 
over  the  group  containing  (he  mean.  Here  (lu^  fre(|uencies  nuist 
be  added  to  obtain  the  frequency  of  the  group.  A  rough  diagram 
of  the  normal  curve  will  clai'ify  (he  wlu>le  procedure. 

In  working  out  \-  Professor  Pearson  nvonnnends  the  con- 
solidation ()(  (he  small  fre(|uencies  in  the  end  groups.  For  the 
top  in(er\al  wt^  shall  (luM-iM'on*  add  0.5  :\ud  l.S.  The  excess  o( 
l.D  below  ;>()  may  also  be  addtni  lo  tlu^  lowes(  group  to  give  7.3. 
Table  58  then  shows  the  remainder  o'i  (he  calculation. 

♦  Hy  Hol/.injitT's  Tablo  \l.        f  1-2123  -f  .0160  -v  .2283.)        J  .0002  i.s  Mow  ;5i^ 
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The  number  of  frequency  jiroups  is  denote<l  by  ti'.  Enter- 
ing Pearson's  Table  XII  with  n'  —  12  and  x*  =  46.6.  we  find 
p  =  .00001,  The  interpretation  of  this  result  is  that  once  in 
100,000  trials  we  sliould  get,  in  nindom  sampling,  a  fit  as  bad 
or  worse  than  that  which  would  be  obtaine<i  if  the  real  distri- 
bution were  representee!  by  the  normal  curve  fitte<l  above.  The 
actual  fit  is  therefore  a  very  bad  one.  Unless  the  value  of  P 
be  .2  or  more,  the  fit  cannot  be  regarded  as  goo<i  aiui  other 
curves  should  be  tried. 

The  importance  of  the  x^  test  arises  from  the  fact  that  it 
furni.shes  a  rigorous  method  for  determining  goodness  of  fit. 


Table  58.  SHOWING  ttif.  Cau  i  u\tion  of  x* 


CLum 

FttQUSNCY 

TBKMdmCAL 

FttiQUBiicr 

ri-/i 

(ri-/i)» 

fi 

/• 

140-160  .    . 

14 

5.3 

+  8.7 

75.69 

14.3 

130-140  .    . 

36 

29.0 

+  7.0 

l?».00 

1.7 

120  130  

103 

121.3 

-  18.3 

334.89 

2.8 

110  120.    . 

318 

354,8 

-36.8 

1354.24 

3.8 

100-110.    . 

799 

735.7 

+  63.3 

4006.89 

5.4 

90-100.    . 

1074 

1093.5 

-  19.5 

380.25 

0.3 

80-90    . 

1059 

1103.6 

-44.6 

1989.16 

1.8 

70-80    . 

868 

796.2 

+  71.8 

5155.24 

6.5 

60-70 

366 

405.1 

-39.1 

1528.81 

8.8 

50-60 

163 

145.5 

+  17.5 

306.25 

2.1 

40-50 

25 

36.7 

-11.7 

136.89 

8.7 

80-40 

9 

7.8 

+  1.7 

2.89 

0.4 

Total 

4884 

4884.0 

00.0 

46.6 

Mere  inspection  of  the  data  is  of  no  value  except  to  suggest  the 
theoretical  form  of  the  cur\'e  to  be  fitted.  When  this  has  been 
selected  by  guess  (or  by  the  method  of  Chapter  XV* I)  the  fit 
should  be  tested  by  a  proce<lure  similar  to  that  sliown  above. 
Other  uses  of  the  x~  function  will  be  given  in  Chapter  XI\". 

A  ver>'  much  abbreviateil  table  for  the  values  of  /'  is  given 
in  Table  59  on' page  248  for  usc^  when  x*  ^^^^  '''  ^^re  not  large. 
This  table  has  been  taken  from  Pearson's  Table  XII.  the  com- 
putation of  which  was  done  by  Mr.  W.  P.  KUicrton, 
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Table  59.  Values  of  P  for  Testing  Goodness  of  Fit 


v2 

n' 

X 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

.986 

.995 

.998 

.999 

1.000 

1.000 

1.000 

1.000 

1.000 

2 

.920 

.960 

.981 

.991 

.996 

.998 

.999 

1.000 

1.000 

3 

.809 

.885 

.934 

.964 

.981 

.991 

.996 

.998 

.999 

4 

.677 

.780 

.857 

.911 

.947 

.970 

.983 

.991 

.995 

5 

.544 

.660 

.758 

.834 

.891 

.931 

.958 

.975 

.986 

6 

.423 

.540 

.647 

.740 

.815 

.873 

.916 

.946 

.966 

7 

.321 

.429 

.537 

.637 

.725 

.799 

.858 

.902 

.935 

8 

.238 

.333 

.433 

.534 

.629 

.713 

.785 

.844 

.889 

9 

.174 

.253 

.342 

.437 

.532 

.622 

.703 

.773 

.831 

10 

.125 

.189 

.265 

.350 

.440 

.530 

.616 

.694 

.762 

11 

.088 

.139 

.202 

.276 

.358 

.443 

.529 

.611 

.686 

12 

.062 

.101 

.151 

.213 

.285 

.363 

.446 

.528 

.606 

13 

.043 

.072 

.112 

.163 

.224 

.293 

.369 

.448 

.527 

14 

.030 

.051 

.082 

.122 

.173 

.233 

.301 

.374 

.450 

15 

.020 

.036 

.059 

.091 

.132 

.182 

.241 

.307 

.378 

A 


8.  The  Probable  Error  of  an  Observed  Proportion 

It  has  already  been  shown  in  Chapter  XI  that  the  mean  and 
standard  deviation  of  the  point  binomial  {q  +  pY  are  given  by 
np  and  y/npq,  respectively.  In  the  case  where  we  are  dealing  with 
K  samples  of  n  events  each,  the  binomial  becomes  K(q  +  pV  for 
which  the  mean  and  standard  deviation  are  the  same  as  before. 

Now  if  the  proportion  of  successes  instead  of  the  actual  num- 
ber is  recorded,  it  will  be  necessary  to  take  one  nth  of  the 
number  in  each  sample.  The  mean  proportion  of  successes  will 
then  approach  p  and  the  standard  deviation  will  be  given  by 


'^f- 


(  Standard  error  "1 
\  of  a  proportion  J 


(104) 


The  equations  for  probable  errors  of  the  mean  number  and  of 
the  proportion  of  successes  in  a  sample  are  therefore 


P. £.„/,  =  . 6745  Vn^, 


and 
respectively. 


r  Probable  errors  of  the 
mean  and  of  the  pro- 
portion of  successes 


(105) 
(106) 
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This  last  formula  may  be  illustrated  by  the  use  of  some  data 
taken  from  the  1920-1921  Register  of  The  University  of  Chicago. 
The  total  number  of  students  for  that  year  may  be  tabulated 
in  the  following  form : 


Men 

Women 

Total 

Graduate  schools  (group  1) 

Undergraduate  schools  (group  2) 

1,433 
3,938 

1,246 

4,768 

2,679(wi) 
8,706(W2) 

Total 

5,371 

6,014 

11,385 

The  problem  is  to  determine  whether  or  not  the  proportion  of 
men  in  the  graduate  schools  is  significantly  larger  than  in  the 
undergraduate  schools.  In  this  case  a  ''  success  "  is  given  by  the 
registration  of  a  man  and  a  "  failure  "  by  the  registration  of  a 
woman,  while  the  total  for  each  is  the  size  of  the  sample,  n. 

The  observed  proportion  of  men  in  the  graduate  schools  is 
2679  =  -535  =  pi,  while  the  proportion  of  men  undergraduates 
is  8  7  06  =  -452  =  p2.  It  is  also  evident  that  g'l  =  .465,  Ui  =  2679, 
^2  =  .548,  and  n2  =  8706.  From  formula  (106)  we  therefore  have 


Pi  =  .535  ±  .6745  J^ 
and       P2  =  .452  ±  .6745  J^ 


535)  (.465) 
2679 

452)  (.548) 
8706 


=  .535  ±  .0065, 


=  .452  ±  .0036. 


The  difference  between  the  two  proportions  may  therefore  be 
written 

Pi-P2  =  .083  ±  V(.0065)^+  (.0036)2  =  ,083  ±  .0074. 

Assuming  that  the  observed  proportions  are  typical  of  other 
years,  or  that  the  above  data  furnish  random  samples,  we  may 
conclude  that  the  graduate  schools  enroll  a  significantly  larger 
proportion  of  men  graduates.  It  should  be  noted,  however,  that 
the  conditions  brought  about  by  the  war  might  invalidate  such 
assumptions.  The  safest  procedure,  therefore,  would  be  to  cal- 
culate the  differences  for  a  number  of  years. 
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Another  method  of  approach  to  the  above  problem  is  to  de- 
termine whether  or  not  the  difference  between  the  two  propor- 
tions could  have  arisen  merely  from  the  fluctuations  in  sampling 
in  case  the  two  groups  are  regarded  as  samples  from  the  same 
or  very  similar  populations. 

The  proportion  of  men  in  both  schools  is  given  by  po  =  ^^3^5 
=  .472,  with  Q'o  =  .528.  The  equations  for  the  probable  errors 
of  the  proportions  in  the  two  samples  will  then  be 


P.E.p=,67^6^^y    rT3    K  Ki  f  .     (107a) 

^1  \    m        I  Probable  errors  of  propor- I     ^  ^ 

<  tions  of  successes,  based  > 

and        P. £.,,  =  . 6745  J^-    ^         <>"  both  groups         J    ^^^^^^^ 
Applying  these  formulas  to  the  above  data,  we  have 


and  P.  E.,^  =  .6745  ^P^^^  =  -0036, 

agreeing  to  four  places  with  the  results  found  by  formula  (106). 
The  difference  test,  of  course,  gives  pi  —  p2  =  .083  ±  .0074  as 
before,  and  we  may  therefore  safely  conclude  that  random 
sampling  could  not  have  accounted  for  the  difference  between 
the  observed  proportions.  The  difference  between  the  values 
given  by  formulas  (106)  and  (107)  is  chiefly  a  theoretical  one, 
for  they  do  not  differ  largely  unless  pi  and  po  differ  largely. 


9.  Response  Error  Formulas 

A  number  of  formulas  for  dealing  with  the  response  error 
described  in  Chapter  V  will  next  be  obtained.  The  notation  to 
be  employed  may  be  given  as  follows : 

zi  and  zi  =  standard  scores  on  two  forms  of  Xi, 
Z2  and  zji  =  standard  scores  on  two  forms  of  X2, 
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ru  and  r2ii  =  reliability  coefficients  of  Xi  and  X2,  respectively, 
ei  and  ei  =  response  errors  in  Xi  by  two  forms, 
62  and  eii  =  response  errors  in  X2  by  two  forms,  and 

s  and  t  =  average  or  "true''  scores  of  an  individual  on  Xi 
and  X2,  respectively. 

We  may  therefore  write 

'     r  Standard  scores  in  ^ 
Zi  =  S-\-eiy      I  terms    of    "true"  . 

Z2=  t  -^  e2y      I  scores      and      re-       v-'^^°>' 

.  ,  I       sponse  error 

Zii=t-\-  en,     ^       "^  ^ 

It  will  be  assumed  in  the  following  proofs  that  the  response 
errors,  e,  are  not  correlated  with  each  other  nor  with  the  true 
scores,  s  and  t.  While  this  assumption  is  a  reasonable  one,  it  is 
not  necessarily  valid  and  the  resulting  formulas  should  be  used 
with  caution  pending  a  verification  of  these  assumptions. 

If  two  forms  of  a  test  are  given  we  may  write 

zi  —  zi  =  ei  —  Bi, 

Squaring,  summing  for  a  group  of  individuals,  and  dividing  by 
Ny  there  results 

2^_  g  HziZi      Hzi'^  ^  2^1^         Sei^/      2^ 
iV  N     '^    N  N  N     ^    N  ' 

or  2  -  2  ri/  =  2  cxf^, 

since  a^  =  1,  (7«,  =  o"^^,  and  re^e^=  0. 
The  required  response  error  formula  therefore  becomes 

/- r  standard  error  of  re-"l     /-irvqx 

(Tei  =  V  1  —  ri/ ,  "i^  sponse  using  z  scores  /    ^        ^ 

or,  if  the  original  scores  X  are  used, 

r  standard  error  of  re- 1 


o-ei-O-xiVl  -Tw     I  sponse  using  X  scores  I    (^^^^ 

These  formulas  give  the  standard  deviation  of  the  N  errors  ei 
for  a  group  and  thus  furnish  an  approximation  to  the  standard 
deviation  of  many  similar  errors  for  a  single  individual.    They 
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measure,  therefore,  the  standard  error  of  response  within  an 
average  individual  of  the  group.  In  case  the  probable  error  is 
used  as  a  unit  we  have 


P.E.e^  (of  individual  X,)  =  .6745  (Tx^^/l  -  Ti/.  (Ill) 

{Probable  error  of  response  for  Xi} 

As  ail  example,  consider  a  test  with  a  reliability  coefficient  of 
.64  and  standard  deviation  of  5.  Substituting  these  values  in 
(111),  we  find  that  P,E.e,  =  .6745  x  5  X  .6  =  2.02.  A  pupil's 
score,  such  as  31,  may  therefore  be  written  31  ±  2  with  the 
interpretation  that  it  is  an  even  chance  that  his  true  score, 
assuming  that  there  is  no  practice  effect,  will  lie  anywhere 
between  29  and  33.  To  facilitate  calculations  of  this  sort, 
values  of  Vl  —  r  have  been  prepared  and  tabled  in  Holzinger's 
Tables  for  Students,  No.  VIII. 

Returning  to  equations  (108)  and  (109),  we  may  next  find  the 
response  error  of  the  difference  zi  —  Z2  between  two  tests  which 
may  be  quite  dissimilar.  The  quantity  required  is  (T(ei-e2)  or 
a^e^-ejj)'   Since  the  errors  e  are  all  uncorrected 


.2  -^^l     I     ^^i         2  2^162^       2    I     -2 

(ei-e,)  ~     A^     "T-     ^  ^  ^e,-r(fe. 


From  (109)  the  error  (Te,  =  Vl  —r  u,  and  (Te,  =  Vl—  r2ii  by 
similar  proof.  Substituting  these  values  in  the  above  equation, 
and  taking  the  square  root  of  both  members,  we  find  that 


r  Standard   error  of 


w(ei     €2)        v«       #1/       izjiy     {^response  for  2;  1 —Co  j    ^        ^ 

or  P.E,  (of  individual  zi  -  z->)  =  .6745  V2  — ri/— r2//.     (113) 

To  illustrate  this  formula  consider  two  tests,  say  in  arith- 
metic and  spelling,  given  in  a  school  grade,  and  assume  the 
reliability  of  both  tests  to  be  .5.  The  difference  between  two 
standard  scores,  say  2.6  and  1.4,  for  a  given  pupil  may  therefore 

*  This  formula  was  first  derived  by  T.  L.  Kelley  in  Jounml  of  Educational  Re- 
search, SeF)tember.  1923.  A  note  on  his  proof  is  given  by  the  writer  in  the  January, 
1925,  issue  of  the  samo  journal. 
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be  written  2.6  —  1.4  =  1.2  i  .6745.  Since  this  difference  is  ap- 
proximately twice  its  probable  error  the  chances  are  about  four 
to  one  that  the  true  difference  lies  between  zero  and  2.4. 

An  observed  standard  deviation  will  be  larger  than  the  true 
standard  deviation  because  of  the  effect  of  response  errors.  This 
may  be  shown  by  writing 

whence  a^^  =  af  +  a^. 

From  equation  (110)     af^  =  a^^  —  a^ru, 

„^  ii  „i  ^        ^    ^r  f  Relation  between  true  and  1         /ii/i\ 

so  that  cTs  =  0^X1  V ri/.     <     -,         j   +    j    j  >       (114) 

^  l^   observed  standard  errors  j 

It  is  therefore  apparent  that  only  for  a  perfectly  reliable  test 
will  the  observed  and  true  standard  deviations  be  equal. 

Professor  Spearman*  has  given  a  number  of  formulas  for  cor- 
recting correlation  coefficients  for  response  error,  or  ''attenua- 
tion" as  he  calls  it.  One  of  the  simplest  of  these  may  be  worked 
out  as  follows  : 

The  correlation  between  ''true"  scores  is  r,t  =  — •    But 

NasCTt     

llst=llziZ2y  from  equation  (108),  while  as=  Vn/and  at  =  ^r2ii. 

Therefore,  Yst  =        ^^^       »      /  Spearman's  correction  |      (^  j  5) 

^Ti  17211        ^       for  attenuation       J 

where  ri2  is  the  observed  correlation. f 

As  an  example  of  the  use  of  formula  (115),  if  an  observed  cor- 
relation is  .6  and  the  reliability  coefficients  of  Xi  and  X2  are 
both  .8,  the  "true"  correlation,  with  response  error  eliminated, 
will  be  .75. 

If  (T  and  2  denote  the  standard  deviations  on  a  test  for  two 
groups,  and  rj  /  and  R\  /  the  respective  reliability  coefficients,  it 
is  evident  from  formula  (110)  that 


ae  =  (tVI  —  r\i     and     aE  =  SVl—  Ru- 

*  C.  Spearman,  "Demonstration  of  Formula*  for  True  Measurement  of  Corre- 
lation. "  American  Journal  of  Psychology,  Vol.  X  VIII  (1907),  p.  161,  and  "  Correlation 
from  Faulty  Data,"  British  Journal  of  Psychology,  Vol.  Ill  (1910),  p.  271. 

t  For  other  correction  formulas  see  Yule,  Introduction  to  Statistics,  p.  213. 
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Assuming  with  Professor  Kelley*  that  Ge  =  (Te,  or  that  the  test 
is  ''equally  effective"  for  both  groups,  we  find  that 


o-      Vl-i?i/  ,^_  ^ 

2  ~      /..  _         '  r  Kelley 's  formula  f or  ^     (116  a) 

VI  —  ri/  I  adjusting  reliability  V 

O"^  —  2^(1  — /?!/)      I  coefficients         J     /--^,\ 

or  rii  = \ -^'  ^    (116b) 

This  formula  has  been  used  to  adjust  correlations  for  differ- 
ent ranges  as  illustrated  by  the  following  examples.  If  the  re- 
liability of  a  test  is  given  hy  R\i  =  .5  for  a  range  with  2  =  5, 
what  will  the  reliability  be  for  a  range  with  standard  deviation 
of  10?  From  (116b)  we  find  rn  =  .875,  which  shows  the  effect 
of  "range  of  talent"  upon  the  reliability  coefficient.  It  should 
be  noted,  however,  that  for  very  small  values  formula  (116) 
gives  results  of  doubtful  significance.  Thus  when  2  =  5,  o-  =  10, 
and  R\i  =  .01,  we  find  rn  =  .75.  That  a  test  which  is  practi- 
cally worthless  on  one  range  should  be  quite  reliable  on  range 
with  twice  as  great  variability  is  contrary  to  all  experience  with 
such  measures. 

A  general  criticism  of  all  the  above  formulas  is  that  the  as- 
sumption of  uncorrected  response  errors  does  not  appear  to  be 
justified.!  Such  negative  evidence,  however,  is  not  sufficient  at 
present  to  warrant  the  entire  abandonment  of  the  formulas,  and 
they  are  offered  here  for  tentative  use  until  further  evidence  in 
proof  is  available. 

EXERCISES 

1.  Find  the  probable  errors  of  the  frequencies  at  I.Q.  80-90  given 
in  the  columns  of  Table  55.      (10.2,  4.1,  12.1,  9.4,  3.7,  19.4.    Ans.) 

2.  Determine  the  probable  errors  of  the  following  correlation  co- 
efficients :  r  =  .162  (N  =  87),  r  =  .083  (N  =  640),  r  =  .204  {N  =  49), 
r  =  -  .137  (A^  =  210),  r  =  .083  {N  =  40).   Use  Holzinger's  Table  X. 

(.070,  .026,  .092,  .046,  .106.   Ans.) 

*  Kelley,  Statistical  Method,  p.  222.    See  also  Chapter  IX  of  the  present  text. 

t  William  Brown  and  Godfrey  II.  Thomson,  in  "Essentials  of  Mental  Measure- 
ment" (Cambridge  University  Press,  England,  1021),  show  correlation  between 
such  errors. 
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3.  Test  the  significance  of  the  differences  between  the  means 
and  standard  deviations  given  in  Table  55.   Use  Table  54. 

4.  The  following  data  were  obtained  from  four  groups : 

Ml  =  104  (Ji  =  10.0  Ni  =  110 

M2  =  101  (72  =  11.0  N2  =  97 

Ms  =  102  0-3  =  9.6  Ns  =  92 

M4  =  103  0-4  =  8.5  N4  =  106 

Find  the  probabilities  that  Mi  will  be  larger  than  M2,  Ms,  and  M4, 
respectively,  on  the  next  sampling.  (.98,  .92,  .79.   Ans.) 

Hint.   Use  Table  54. 

5.  In  a  six-month  period,  454  deaths  from  automobiles  were  re- 
ported in  New  York  and  260  in  Chicago.  The  populations  of  the 
two  cities  were  5,600,000  and  2,700,000,  respectively.  Are  ''  Gotham's 
streets  safer  for  the  pedestrian  than  Chicago's,"  as  reported  by  a  cer- 
tain newspaper?    (Difference  in  death  rates  is  three  times  its  P.E.) 

6.  The  following  data  were  taken  from  the  President's  Report  of 
The  University  of  Chicago,  1923-1924. 


Graduate  schools     .    . 
Undergraduate  schools 

Total 


Men 


2,083 
4,215 


6,298 


Women 


1,634 
5,425 


7,059 


Total 


3,717 
9,640 


13.357 


Find  the  proportion  of  men  in  the  graduate  and  in  the  undergrad- 
uate schools,  and  test  the  significance  of  the  difference  found. 

(pi  -p2  =  .123  ±  .0065.   Ans.) 

7.  Fit,  with  a  normal  curve,  the  distribution  of  the  Terman  scores 
given  in  Exercise  3  of  Chapter  II,  and  apply  the  x^  test. 

(P  =  .6.   Ans.) 

8.  Apply  the  x^  test  to  the  distributions  of  I.Q.'s  fitted  in  Exer- 
cise 9  of  Chapter  XII. 


CHAPTER  XIV 


FURTHER  METHODS  OF  CORRELATION  FOR 
TWO  CHARACTERS 

1.  Introductory 

The  correlation  methods  discussed  thus  far  have  been  those 
which  are  applied  to  quantitative  series  or  to  traits  which  are 
measurable  on  a  numerical  scale.  In  case  the  series  are  quali- 
tative or  unordered,  in  the  sense  used  in  the  second  chapter, 
other  methods  for  measuring  the  association  become  necessary. 
The  present  chapter  will  therefore  be  concerned  with  the  treat- 
ment of  such  series  by  suitable  methods. 

In  order  to  illustrate  the  combinations  of  series  that  may  arise, 
we  may  begin  by  listing  some  of  the  possibilities  with  short  sup- 
posititious examples.  The  table  below  illustrates  the  case  of  an 
association  for  quantitative  and  qualitative  series,  intelligence 
being  measured  on  a  numerical  scale  and  school  work  rated  in 
verbal  categories  in  orderly  progression. 

Table  60.   Illustrating  Associated  Quantitative 
AND  Qualitative  Series 


School  Work 

8 

I.Q. 

0 

90                       100 

110 

120 

Good 

3 

12 

14 

11 
2 

Medium    .... 

4 

15 

17 

< 

Poor 

7 

3 

12 

— 

(y 

Quantitative 

Table  61  on  page  257  shows  the  association  between  two 
qualitative  series,  both  characteristics  being  verbally  indexed. 

266 
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Table  61.    Illustrating  Associated  Qualitative  Series 


School  Work 

Behavior 

Bad 

Troublesome 

Good 

Excellent 

Good 

3 

9 

12 

14 

> 

Medium 

4 

10 

16 

2 

H 

Poor 

10 

2 

7 

— 

< 

Qualitative 


In  both  tables  there  appears  to  be  some  association  between 
the  traits,  but  it  cannot  be  adequately  measured  by  the  product- 
moment  correlation  in  the  form  used  in  Chapter  IX,  because 
of  the  lack  of  numerical  indexes  for  the  categories. 

An  example  of  association  for  quantitative  and  unordered 
series  is  next  given  in  Table  62,  the  characteristics  being  the 
intelligence  of  children  and  the  occupation  of  their  fathers. 


Table  62.   Illustrating  Associated  Quantitative 
AND  Unordered  Series 


Occupation  of  Father 

I.Q.  of 

Child 

8 

0 

90                      100 

110 

120 

Teacher 

7 

11 

12 

10 

Doctor 

3 

9 

14 

8 

ta 

Q 

Lawyer 

3 

6 

9 

12 

06 
O 

■z 

Writer 

3 

4 

7 

11 

Quantitative 

The  relationship  in  this  case  cannot  be  observed  very  readily, 
because  the  arrangement  of  the  occupation  categories  is  a  matter 
of  indifference.  A  quite  different  method  of  measuring  associa- 
tion will  therefore  be  required  for  such  a  problem. 
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A  complete  list  of  the  combinations  of  series  which  may  arise 
is  given  as  follows : 

a.  Quantitative  with  quantitative 

h.  Quantitative  with  qualitative 

c.  Quantitative  with  unordered 

d.  Qualitative  with  qualitative 

e.  Qualitative  with  unordered 
/.  Unordered  with  unordered 

While  some  of  these  occur  only  rarely  in  statistical  work,  it  is 
nevertheless  desirable  to  have  suitable  methods  for  dealing  with 
each  type  of  association.  The  methods,  however,  are  by  no 
means  restricted  to  one  type  of  problem,  and  consequently  the 
choice  often  becomes  a  difficult  matter.  In  the  present  discus- 
sion we  shall  select  a  few  of  the  outstanding  methods  available 
and  apply  them  to  problems  with  suggestions  as  to  the  appro- 
priate method  to  employ  whenever  possible. 

2.  Another  Formula  for  the  Product-Moment  Method 

Before  taking  up  the  correlation  of  qualitative  series  we  shall 
first  introduce  a  modification  of  the  product-moment  formula 
convenient  for  dealing  with  such  data.  The  method  was  pre- 
sented by  Professor  Pearson  in  one  of  his  lectures  at  the  Uni- 
versity of  London. 

Using  the  notation  of  Chapter  IX,  the  product-moment 
formula  may  be  written 

(2A^.)(2/,c/,) 


^fxydxdy  — 


hk 


N 

r  =  - t; ■ y  (117) 

N(TxCry 

where  the  product  hk  occurs  because  the  numerator  is  expressed 
in  class  intervals.  If  Yx  denotes  the  mean  of  a  column  and  My 
the  mean  of  the  whole  table,  it  is  also  evident  that 

V  71  f     —  /  ^  J^iMn         ^fudu\  J. 
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Multiplying  both  members  of  this  equation  by  hdxy  summing 
over  the  whole  table,  and  noting  that  fx  is  merely  a  symbol  of 
operation,  we  have 

2/x(ix(yx    -    My)h   =    [^fxydxdy   "    (^^^^K^/A)  j  ^^^^^ 

Substituting  this  result  in  formula  (117),  we  then  obtain 

^fxd,(Y,    -     My)h 


r  = 


and,  similarly. 


No'xO'y  r Pearson's  formulas  for  the"! 

■I  correlation  coefficient  based  V 

llfydy(Xy  —  M x)k     I  »"  the  meatts  of  the  arrays  J 

NOTxO-y 


(118a) 


(118b) 


The  above  method  is  very  convenient  when  the  means  of  the 
arrays  are  known,  for  it  is  then  not  necessary  to  calculate  the 
quantity  ^fxydxdy  from  the  individual  cells.  It  should  be  noted 
that  the  variables  dx  and  dy  may  be  taken  from  any  origins 
whatsoever,  and  it  may  seem  a  little  curious  at  first  that  the 
values  of  formulas  (118  a)  and  (118  b)  remain  unchanged  when 
the  origins  are  shifted  and  all  quantities  except  dx  and  dy  are 
fixed  throughout  in  these  formulas. 


Table  63.   Illustrating  the  Calculation  of  the  Correlation 
Coefficient  by  Formula  (118  a) 


X 

Yx 

Yx  -  My 

fx 

dx 

{Yx-My)fxdxh 

184.5 

72.25 

+  18.50 

1 

5 

+  925.00 

174.5 

52.25 

-1.50 

1 

4 

-  60.00 

164.5 

64.75 

+  11.00 

4 

3 

+  1320.00 

154.5 

60.89 

+  7.14 

11 

2 

+  1570.80 

144.5 

57.25 

+  3.50 

9 

1 

+  315.00 

134.5 

52.70 

-1.05 

11 

0 

0.00 

124.5 

45.25 

-8.50 

5 

-1 

+  425.00 

114.5 

42.25 

-11.50 

4 

-2 

+  920.00 

104.5 

37.25 

-16.50 

2 

-3 

+  990.00 

94.5 

37.25 

-  16.50 

1 

-4 

+  660.00 

84.5 

32.25 

-21.50 

1 

-5 

+  1075.00 

50 

8140.80 

My  =  53.75 


Gr  =  19.92 


(Ty     =      10.50 


r  =  «liM=  +  .778 
10458 


No-, 


ay  =  10,458 


2G0 
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In  order  to  illustrate  the  application  of  this  method  to  quan- 
titative data,  the  correlation  problem  shown  in  Table  30  of 
Chapter  IX  has  been  worked  out  on  pap:e  259,  using  formula 
(llSa).  The  means  of  the  columns  y^  were  calculated  as  for  any 
distribution,  and  the  values  for  dx  were  taken  from  the  arbitrary 
origin  134.5.  A  check  on  the  numerator  of  (118a)  may  be  made 
by  shifting  to  another  origin  and  recalculating  the  sum  of  all 
the  products.    The  proof  of  this  check  is  left  as  an  exercise. 

3.  The  Product-Moment  Method  for  Qualitative  Series 

A  (lualitative  series  may  be  converted  into  a  quantitative 
one  by  representing  the  data  on  a  normal  scale  as  shown  in  sec- 
tion 7  of  Chapter  Xll.  The  various  groups  will  then  be  desig- 
nated by  numbers  instead  of  by  verbal  description,  and  the 
product-moment  method  may  then  be  applied  for  measuring 
the  amount  of  correlation. 

The  following  table  represents  the  correlation  between  the 
score  on  a  physics  test  and  the  rating  of  the  teachers  for  245 
high-school  pupils.  The  combination  is,  therefore,  a  quantita- 
tive series  with  a  qualitative  one,  and  the  latter  will  need  to  be 
converted  to  a  normal  sciile. 

Tahlk  (U.    Data  tkom  a  riivsics  Tkst  and  Tkachku  Rating 


Ttax  ScoRB 

Tkachku  Rating 

Total 

Poor 

1 
11 

19 

10 

(i 

1 

Kair 

6 

15 

26 

17 
<) 

1 

Good 

Excrllrnt 

2 

18 
18 
16 

4 

1 

70  80 
60  70 
r)()  60 
40  50 

no  10 

20   'M) 

10  20 

0-10 

2 

12 
24 

2n 
<) 

1 

4 
'M 

6S 
SI 
•10 
10 
1 
1 

Total 

48 

67 

71 

59 

245 

Per  cent 

10.6 

27.8 

29.0 

24.1 

100.0 
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The  ordinates  bounding  the  various  pieces  under  the  normal 
curve  are  most  readily  found  by  entering  a  table  such  as  Hol- 
zinger's  Table  XII  with  the  cumulative  frequencies  .196,  .469, 
and  .759,  each  less  .5.  or  with  the  values  —  .304.  —  .031.  and 
.259.  The  three  ordinates  re- 
sulting are  .2766,  .3977,  and 
.3116,  respectively,  as  illus- 
trated in  Fig.  66. 

The  means  of  the  various 
pieces  may  now  be  worked 
out  by  formula  (S6i  of 
Chapter  XII;  for  example. 


^p  _ 


0  -  .2766 


/.196 

.273 

.2S«J 

.241  \ 

=  -1.411, 


-3t 


-1.4i:cr- 444(7-^/7(7 -1.293(7 

-2(7       -1(7  0  1(7  2<7 


3<7 


(Tx  .196 

_  Fig.  66.   Illustrating  the  means  of  the 

where  —  is  the  mean  of  the      ^o^  mating  categories  when  the  series 

Cx  is  represented  on  a  normal  scale 

''poor"  category.     For  the 

other  three  means,  we  obtain  —.444,  .297,  and  1.293.    These 

numbers  are  to  be  regarded  as  class  values  in  the  subsequent 

calculations. 

Since  M^  =  0  for  a  normal  distribution,  the  required  formula 
may  be  obtained  from  (118  b  >  in  the  form 


T  = 


S(Ji 


f  Correlation    coefficient  "1 

^  adapted    for  use   with  v      (119) 

l^data  on  a  normal  scale  j 


where  —  denotes  the  mean  of  a  row  measured  from  the  mean  of 

the  table.    The  values  for  —  are  obtained  bv  multiplving  the 

frequencies  in  each  row  by  the  class  values  just  obtained,  and 

dividing  by  the  total  in  the  row.  Thus  for  the  top  and  next  row, 

^  ^  2  X  .297 -f  2  X  1.293 

(T.  4 

Xf^^  If-  1.411) -h6f-  .4441  4-  12. 297)  +  18(1.293^ 
cT.  37 


=  4- .795. 


=  4- .615, 


262 


STATISTICAL  METHODS  IN  EDUCATION 


and,  similarly,      ^  =  +  .121,      ^  =  -  .129,     ^  =  -  .345, 

'^  (Tx  (Tx  <Tx 

^5  =  _  .776,  ^  =  -  .444,  and  -  =  -  1.411. 

(Tx  (Tx  (Tx 

An  arrangement  of  the  computation  for  the  product  sum  and 
ay  is  shown  below.   The  work  is  best  done  with  a  machine. 

Table  65.  Illustrating  the  Calculation  of  the  Correlation 
Coefficient  for  the  Data  in  Table  64 


_ 

fy 

dy 

fydy 

Xy 

Ox 

!yi.% 

h^r 

4 

3 

12 

.795 

9.540 

36 

37 

2 

74 

.615 

45.510 

148 

68 

1 

68 

.121 

8.228 

68 

84 

0 

0 

-.129 

0.000 

0 

40 

-1 

-40 

-.345 

13.800 

40 

10 

-2 

-20 

-.776 

15.520 

40 

1 

-3 

-3 

-.444 

1.332 

9 

1 

-4 

-4 

-1.411 

5.644 

16 

245 

+  87 

99.574 

357 

Gy     =     11.54 


No-y  -  2827.3 


r  = 


99.574  X  10 

2827.3 


=  .352 


By  plotting  the  means  of  the  rows  —  as  shown  in  Fig.  67,  a 

^  X 

graphical  representation  of  the  regression  is  given.  It  will  be 
noted  that  the  points  fall  fairly  closely  along  a  straight  line,  so 
that  the  regression  is  probably  to  be  regarded  as  linear.  The 
equation  of  the  regression  line  through  the  mean  of  the  table  is 

—  =  —  y,ov  —  =  .0305  y.   Since  M„  =  48.55,  two  points  for  plot- 

(Tx  CFy  dx 

ting  are  given  by  substituting  ?/  =  i  30  in  the  above  equation  or, 

—  =.915  ,     [^  =  -.915 

ax  and     <  ax 

^  y=  78.55  [  Y=  18.55 

When  both  series  are  qualitative,  the  above  method  may  be 
applied  to  the  two  scales;  but,  since  certain  corrections  are 
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Mi/= 


48.55 


sometimes  desirable, 
another  procedure  will 
be  shown.  In  calcu- 
lating the  correlation 
coefficient  and  other 
measures  of  associa- 
tion an  error  is  intro- 
duced by  grouping 
the  material  in  broad 
categories.     Professor 

Pearson*  has  devised  several  formulas  for  correcting  this  error, 
one  of  which  may  be  written  in  the  form 


Fig.  67.   Regression  line  for  the  physics  data 


2/ 


xy 


cTxy  — 


_N_ 


(Zs  —  Zs  +  l)  {z's  —  Z' s  +  l) 


Vh 


2s  +  0' 


JU 


f 


(120) 


fPearson's  corrective  formula  for  broad  grouping  "1 
\    assuming  normal  distributions  of  the  variates    j 

where  the  2;'s  are  ordinates  bounding  the  various  pieces  under 
the  normal  curve  and  the  unprimed  and  primed  values  refer  to 
X  and  y,  respectively.  The  use  of  this  formula  will  next  be 
illustrated  by  a  problem  which  has  been  taken  from  Professor 
Pearson's  paper  cited  in  footnote  below. 

Table  66.   Pearson's  Data  on  Intelligence  and  Quality  of  Clothing 


Quality  of 

Intelligence  Rating 

Total 

Clothing 

B 

C 

D 

E 

F 

G 

I 

33 

48 

113 

209 

194 

39 

636 

11      

41 

100 

202 

255 

138 

15 

751 

Ill 

39 

58 

70 

61 

33 

4 

265 

TV  and  V 

17 

13 

22 

10 

10 

1 

73 

Total 

130 

219 

407 

535 

375 

59 

1725 

♦Karl  Pearson,   "On  the  Measurement  of  the  Influence  of  Broad  Categories 
upon  Correlation,"  Biometrika,  Vol.  IX,  p.  119. 
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By  using  X  for  intelligence  and  Y  for  quality  of  clothing,  the 
ordinates  on  the  two  scales  may  be  found  in  the  usual  way,  and 
the  quantities  needed  for  formula  (120)  may  be  worked  out 
as  shown  in  Table  67,  pp.  264  and  265.  Holzinger's  Table  XII 
has  been  used  throughout.  The  corrective  value  becomes  .315. 
Comparing  this  result  with  that  obtained  in  the  computation  by 
Professor  Pearson,  it  must  be  noted  that  his  .317  was  worked 
out  with  a  somewhat  different  corrective  formula. 

Needless  to  say,  the  arithmetic  is  very  laborious  and  must  be 
done  on  a  calculator.  The  above  correction,  however,  is  impor- 
tant, and  formula  (120)  or  similar  forms  given  in  Pearson's  paper 
should  be  used  for  the  best  results. 

In  case  only  a  rough  approximation  to  the  correlation  is  de- 
sired, class  values  such  as  1,  2,  3,  •  •  •  may  be  assigned  to  both 
sets  of  categories,  and  the  coefficient  may  be  worked  out  by 
the  method  of  Chapter  IX.  The  student  is  urged  to  work  out 
this  value  for  the  above  problem  in  order  to  compare  results. 

4.  The  Correlation  Ratio  for  Qualitative  and 
Unordered  Series 

When  a  series  has  been  represented  on  a  normal  scale,  the  cal- 
culation of  the  correlation  ratio  becomes  very  simple.  The  work 
will  be  illustrated  by  the  problem  of  the  preceding  section. 

Since  Mx  =  0,  formula  (61)  for  the  correlation  ratio  based  on 
the  means  of  the  rows  becomes 


^ 


^Jy\~l         r Correlation  ratio  adapted 


N  _       •'"Vo- 


•n^^  =  =  X  / .     <  for  use  with  data  on  a  nor-  >     (121) 

^x  yi         N  j^  nial  scale  J 

For  the  data  given  in  Table  64  the  arithmetic  may  be  ar- 
ranged as  shown  in  Table  68.  The  work  is  very  easily  done  in 
this  problem  because  the  means  of  the  rows  are  already  worked 
out  in  Table  65.  The  complete  calculation  is  shorter  than  that 
for  the  correlation  coefficient,  since  ay  is  not  required. 
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Table  6S.   Illustr-\td^g  the  Calcix.a.tion'  of  the  Correlation 

Ratio  with  Formlxa   121  ^ 


.795 
.615 
.121 

-.129 
-.345 
-.776 
-.444 
-1.411 


m 


.6320 
.3782 
.0146 
.0166 
.1190 
.6022 
.1971 
1.9909 


37 

68 

84 

40 

10 

1 

1 

245 


Mi^r 


2.52S0 
13.9934 
0.992S 
1.3944 
4.7600 
6.0220 
0.1971 
1.9909 
31.8786 


31,8786  ^  ^3^^.-, 

245 
.-.  V  =  V.  13012  =  .361 


Apphing  Blakeman's  sliorter  test  for  linearity,  we  find  that 


V245  V(.361 )-  -  (,.352  -  =  1.25  <  4.05. 

Since  1.25  is  less  than  one  third  of  4.05  and  .V  is  fairly  large, 
the  regression  in  this  case  may  be  regarded  as  sensibly  linear. 

If  one  of  the  associated  series  is  quantitative  or  qualitative 
and  the  other  unordered,  one  of  the  correlation  ratios  may  al- 
ways be  found.  Thus,  if  Y  be  quantitative,  the  ratio  t?.-  has 
the  form 


\ 


^4/r  = 


r  Correlation  ratio  for^       .nc)\ 
\   means  of  columns  ^     ^ 


and  is  to  be  regarded  as  the  ratio  of  two  standard  de\iations, 
both  depending  upon  ^' only.  The  arrangement  of  the  A  cate- 
gories is  clearly  a  matter  of  indifference,  since  it  will  not  affect 
the  numerator  or  a,  in  the  above  expression. 

An  example  of  a  qualitative  and  an  unordered  table  is  fur- 
nished by  some  data  from  a  study  by  Mr.  Tulchin  of  the  Chi- 
cago Institute  for  Juvenile  Research.  A  large  number  of  children 
were  rated  by  their  teachers  as  of  the  ''anno>'ing,"  ''s>Tnpa- 
thetic,"  or  "  uns\Tii pathetic  "  t^-pe.  and  also  classified  in  five  intel- 
ligence categories.  Inasmuch  as  the  three  ''  attitude ' '  categories 
do  not  necessarily  come  in  any  order,  they  furnish  an  unordered 
series.  The  table  of  frequencies  appears  as  shown  on  page  268. 
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Table  69.   Tulchin's  Data  on  Intelligence  and  Attitude 


INTKLLUJKNC'B 

Attitude 

Total 

Annoying 

Unsympathetic 

Sympathetic 

5  Very  Superior     .    .    . 

4  Superior 

3  Normal 

2  Inferior 

1  Very  Inferior  .... 

5 

24 
105 
131 

73 

12 
103 
108 

82 

219 
1213 
2451 
1021 

174 

224 
1249 
2659 
1260 

329 

Total 

338 

305 

5078 

5721 

Although  the  method  employed  with  this  problem  will  be  the 
same  as  that  for  the  physics  test,  the  results  will  be  worked  out 
for  the  purpose  of  further  illustration  and  for  comparison  of  the 
association  measured  by  a  later  method.  The  percentage  fre- 
quencies of  the  intelligence  distribution  are  5.8,  22.0,  46.5,  21.8, 
and  3.9,  beginning  with  the  Very  Inferior  group.  The  ordinates 
between  the  pieces  by  the  method  of  the  preceding  section  are 
therefore  .1160,  .3354,  .3224,  and  .0844  (Holzinger's  Table  XII). 
By  formula  (86),  the  means  of  the  five  pieces  under  the  marginal 
distribution  become 


j/i  _0-  .1160 

(Jy  .058 

p3 


_  2.000,     ^>  =  J.160-.3354 


2/4 


.220 

y/r. 


=  -  0.997, 


^  =  +  .028,     i^  =  4-1.092,    and     i^  =  + 2.164. 


a 


ATultiplying  these  class  values  by  the  corresponding  frequen- 
cies in  the  columns,  the  means  of  the  three  columns  become 


11  =  -  .7001,     ^ 

Gy  (Ty 


=  -  .8383,     and     ^  =  +  .0987, 


the  subscripts  referring  to  the  verbal  categories.  The  remainder 
of  the  computation  is  given  in  Table  70. 

There  are  a  number  of  corrections  which  may  be  applied  to 
the  correlation  ratio  to  adjust  for  too  coarse  or  too  fine  grouping. 
The  correction  for  broad  categories  may  be  illustrated  in  the  case 
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Table  70.    Illustrating  the  Caujulation  of  the  Correlation 
Ratio  for  Tulchin's  Data 


429.475 
5721 


Vz 
ay 

(r 

Sz 

^'(I'J 

+  .0987 

-  .8383 

-  .7001 

.702747 
.490140 

607  H 
305 
338 

5721 

49.470 
214.338 
165.667 
429.475 

=  .07507       .-.  rjy^  =  V.07.507  =  .274 


of  the  data  in  Table  64  for  which  17,,^  =  .385.  With  r.rjy^  denoting 
the  corrected  ratio  and  r^^  the  correlation  of  x  with  its  class 
value,  Professor  Pearson*  has  shown  that 

—  ^*^*      /  Corrfilation  ratio  r-orrected  1     n99^ 
c  \yx  ~  Y^c      '        ^'^'*  ^>^^^'^^^  categories       /  ^'^ 


where     r,.  =  Js^(2,  -  Z,  .  O^.    / Correlation  of  a  variable      ^^gg 

\      fx  '       With  Its  class  value      j     ^        ' 

The  computation  will  therefore  be  as  follows : 


Table  71.    Illustrating  thf:  Calculation  with  Formt;la  n22j, 
FOR  the  Data  of  Table  64 


z«  -  z*  4-  1 

(^•-2«+l)2 

iz 

JZ 

20  -  zi  =  -  .2766 
21-22  =  -.1211 

22  -  2,  =  +  .0861 

23  -  2i  =  4-  .3116 

.076508 
.014665 
.007413 
.097095 

5.1042 
3.6567 
3.4507 
4.1525 

.390512 
.053626 
.025580 
.403187 
.872905 

r^r.  =  V^872905  =  .934      .*.  rf]^^  =  —  -  =  .412 


.934 


In  case  there  is  a  fairly  large  number  of  categories  and  N 
is  not  large,  a  correction  for  fineness  of  grouping  may  become 

♦  Karl  Pf*arson,  "On  thf.  Mf?a«urf?rinffnt  of  thf;  Influ^-nr;^  of  Broad  Cat^jjonVs  Uf>on 
CoTTflation,"  liiometrika,  Vol.  IX,  p.  116.  S*?*;,  also,  Student,  "The  Correction  to  be 
made  to  the  Correlation  Ratio  for  Grouping,"  ibid.  p.  316. 
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important.  This  adjustment  is  especially  important  in  dealing 
with  small  coefficients  even  if  N  be  large,  as  may  be  illustrated 
by  an  example  in  a  paper  *  by  the  writer.  The  correlation  ratio 
for  breathing  capacity  on  reaction  time  to  sight  was  found  to 
be  .1404.  Mr.  R.  A.  Fisher  j  has  proved  that  when  we  sample 
from  material  for  which  the  actual  value  of  rj  is  zero,  and  t  is 
the  number  of  arrays,  then  the  mean  value  rj'-^  from  sample  to 

sample  will  be  — — -,  where  N  is  the  size  of  the  sample.    In 

other  words,  although  the  true  value  is  zero,  the  observed 
value  will  not  be  zero,  owing  to  the  grouping  and  to  the  sam- 
pling deviations  which  must  always  enter  as  positive  quanti- 
ties. In  the  present  example  N  =  3373  and  t  =  17,  so  that  rj- 
from  this  formula  is  .004745,  the  probable  error  of  which  is 

.6745  aJ^^'^^'^^^  or  .001128.   The  difference,  rj'^-1-,  may 

now  be  written  as  .014967  ±  .001128,  and  we  may  conclude  that 
it  is  extremely  unlikely  that  the  ratio  found  could  have  arisen 
from  the  fluctuations  in  uncorrelated  material. 

For  breathing  capacity  on  keenness  of  hearing  we  find,  like- 
wise, rj  =  .0840  ±  .0115,  t  =  15,  and  rj^-rj^  =  .002904  ±  .001056. 
In  this  case  the  observed  value  would  appear  to  be  significant 
by  the  usual  test  based  on  its  own  probable  error ;  but  when 
rj^  and  rj'^  are  compared,  their  difference  is  less  than  three  times 
the  probable  error  of  rj^,  and  hence  the  observed  correlation  of 
.0840  may  be  ascribed  to  the  fluctuations  in  sampling.  Breath- 
ing capacity  and  keenness  of  hearing  are  therefore  uncorrelated. 

Corrections  for  coarseness  of  grouping  may  also  be  made  in 
the  case  of  the  correlation  coefficient.  The  reader  is  referred  to 
Sheppard's  corrections  given  in  Chapter  XVI  and  to  a  paper 
by  Professor  Pearson.  J 

♦  Karl  J.  Holzinger,  "On  the  Relation  of  Vital  Capacity  to  Certain  Psychical  Char- 
acters," Biomefrika,  Vol.  XVI,  p.  145. 

t  R.  A.  Fishor,  "The  Goodness  of  Fit  of  Regression  Formulas."  Journal  of  the 
Royal  Statistical  Society,  Vol.  85,  p.  597. 

X  Karl  Pearson,  "On  the  Correction  Necessary  for  the  Correlation  Ratio,"  Bio- 
mctrika,  Vol.  XIV,  p.  412. 
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Group  1 


Group  2 


5.   BiSERIAL  r 

If  one'of  the  characters  in  a  table  such  as  Table  72  is  quantita- 
tive and  the  other  consists  merely  of  two  qualitative  categories, 
it  is  possible  to  find  the 
correlation  very  simply 
by  a  method  known  as 
biserial  r. 

In  the  derivation  of 
this  coefficient  it  is  only 
necessary  to  assume  that 
the  distribution  of  the 
twofold  (or  dichotomous) 
character  is  normal,  and  ^y 
that  the  regression  in  the 
table  is  linear.  From  Fig. 
QS,  where  the  usual  nota- 
tion is  illustrated,  it  ap- 


FlG.  68.   Illustrating  biserial  r 

pears  at  once  that  the  slope  of  the  regression  line  is  given  by 


^■^        X2        Xi        X2  —  Xi 

Making  use  of  the  above  value,  the  correlation  coefficient  may 


now  be  written 


or 


7  WX 


yx 


r  = 


y2  -  vi 


(Jy 

X2  —  Xl 


The  numerator  of  this  last  expression  becomes  (72  —  Yi)/ayy 
that  is,  the  difference  between  the  means  of  the  two  columns, 
divided  by  the  standard  deviation  of  Y  for  the  whole  table. 

Xo  Xl 

The  quantities  -^  and  —  are  the  means  of  the  two  pieces 
under  the  normal  curve  and  are  readily  found  by  the  use  of 

Til 

formula  (86).  Denoting  the  fractional  area  —  by  q  and  the 
remaining  area  — =  by  p,  it  follows  from  (86)  that 
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X2  —  Xi  _  z      z  _  z(q  -\-  p)  _  z 
(Tx      ~~VQ~       PQ       ~  PQ 
The  desired  formula  may  then  be  written 

Y2  -  ?i  /pq\  „.     .  , 

rbis.  =  — ( ^  )  •  {Bisenal  r] 


(124) 


Table  72.   Rent  and  Health  of  Yearling  Babies  illustrating 
THE  Method  of  Biserial  r 


Rent  in  Shtt.i.inos 

Health 

TOTAI 

Not  Good  (1) 

Good  (2) 

8.5 

8.0 

1 

1 

2 

^=9  =  .2856 

N 

7.5 

7.0 

— 

4 
4 

4 
4 

|=p=.7144 

6.5 

6.0 

5.5 .......  . 

5.0 

4.5 

4.0 

1 

1 

4 

16 

53 

101 

13 
18 
45 
82 
252 
303 

14 
19 
49 
98 
305 
404 

z  =  .3399 

Yi  =  3.7065 

72  =  4.1798 
<Ty  =  .8021 
(Sheppard's 

3.5 

132 

182 

314 

correction) 

3.0 

2.5 

55 
26 

64 
18 

119 

44 

r  =  .354 

2.0 

7 

7 

14 

Total      .... 

397  =  ni 

993  =  W2 

1390  =  N 

The  computation  will  next  be  illustrated  by  Table  72.  The 
means  Yi  and  Y2  are  found  to  be  3.7065  and  4.1798,  re- 
spectively, and  ay  =  .8021  with  Sheppard's  correction.  Next, 
dividing  ni  by  N  gives  q  =  .2856  and  dividing  7?2  by  N  gives 
p  =  .7144.  Upon  entering  Holzinger's  Table  XII  with  p  —  .5  = 
.2144  the  value  for  z  is  found  to  be  .3399  with  linear  interpola- 
tion.   Substituting  all  these  values  in  formula  (124),  we  find  that 

_  (.4733)  (.2040)  ^.09655^  ^..^ 
^      (.8021)  (.3399)        .2726 

We  may  therefore  conclude  that  there  was  some  tendency 
for  the  good  health  of  yearling  babies  to  be  associated  with 
a  relatively  high  rent  for  the  home. 
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The  probable  error  for  biserial  r  when  g  is  not  less  than  .05  is 
given  approximately  as 


.6745 


^.■^.(bis.  r)  = 


^_y2 


Vn 


f  Probable  error  1    /iok\ 
\    of  biserial  r    j    ^       ^ 


6.  The  Coefficient  of  Contingency 

When  both  characteristics  are  unordered  the  above  methods 
cannot  be  used,  and  we  must  resort  to  the  theory  of  probability 
in  order  to  secure  a  measure  of  association.  To  illustrate  this 
method,  which  is  known  as  contingency,  we  may  take  a  very 
simple  correlation  table  such  as  the  following,  the  numbers 
being  taken  small  for  convenience. 


A 

B 

C 

fv 

L 

1 

2 

3 

M 

2 

5 

7 

N      

2 

4 

1 

7 

0 

3 

3 

/x 

7 

10 

3 

20 

For  the  cell  marked  in  heavy  lines,  we  shall  have 

/x.  =  5,    /x  =  10,     and    fy  =  7. 

The  probability  that  a  measure  will  fall  in  a  given  column  fx 
is/x  N  (for  example,  ^[y),  since  fx  of  the  N  equally  likely  oc- 
currences are  favorable.  Similarly,  the  probability  that  a 
measure  will  fall  in  a  particular  row  is  fy/N  (for  example,  2V). 
If  now  these  two  events  are  regarded  as  independent,  the  proba- 
bility for  their  combined  occurrence  is  the  product  of  the  two 

f  f 
probabilities  above,  or  =^   (for  example,  tVV)-     Out  of  the 

N  measures,  therefore,  we  should  expect  A^  (  =t^  ),  or  ^,  to  fall 

\N^  /  N 

in  a  particular  cell  if  the  characters  are  entirely  independent. 
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For  the  marked  cell  the  observed  frequency  is  5  as  compared 

with  an  independence  frequency  of  J^,  or  3.5.   The  difference 

f  f 
5  —  3.5  =  1.5,  or  in  general  fxy  —  ^^tt^  '  is  thus  a  measure  of  the 

A 

departure  of  the  two  characters  from  complete  independence, 

that  is,  of  contingency. 

Professor  Pearson  *  has  defined  the  mean  square  contingency 

for  the  whole  table  by  the  relation 


Mean  square  1 
contingency   ^    (126) 
function 


y2      1 

^        N       N 

fjy 

I         N 

The  X"  function,  it  will  be  noted,  is  the  same  as  that  used  in 
Chapter  XIII.  What  is  really  wanted,  however,  is  a  coefficient 
varying  between  0  and  1,  and  this  is  given  by 

/     Ji  I     Z2         ['Coefficient  of  1 

C  =  yj T^  =  "\,' <  mean   square  >    (127) 

\i-}-<p  ^A'  +  X       i^  contingency  J 

and  called  by  Pearson  the  coefficient  of  mean  square  contingency. 
In  the  paper  cited  in  the  footnote  below  he  shows  that  when 
both  of  the  characters  are  normally  distributed  the  limiting  value 
for  C  for  many  categories  is  the  correlation  coefficient  r. 

A  form  of  (127)  which  is  more  convenient  for  calculation  may 
be  obtained  by  noting  that 

^,  ^  ^ ll^\-  2  2/.,  +  ^  =  S'  -  AT, 

L  N  J 

where  S'  is  the  squared  sum  and  N  results  from  the  remaining 
terms.   We  may  therefore  write 

/     S'-N  fs"^^     f  First  computaO 

^  =  ^^N^S'-N=\-^^'  1^^""     "'"^    for     (128a) 
\i\  -\-  c>        i\        \       .5  (^    contingency    J 

*  Karl  Pearson.  "On  the  Theory  of  Contingency  and  its  Relation  to  Association 
and  Normal  Correlation,"  Draper's  Research  Memoirs,  Biometric  Series  I,  1904. 
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This  is  the  formula  recommended  by  Yule,*  but  the  writer 
prefers  the  following  one,  obtained  by  setting  S'  =  NS, 

r  Second  compu-1 
■I  tation  form  for  j>   (128  b) 
[    contingency    J 


^^ 


S-l 


where  S  is  now  S 


f 


xy 


JxJy 


The  calculation  of  C  is  very  simple.  If  formula  (128b)  is  used, 
the  observed  cell  frequencies  fxy  are  first  squared,  then  the 
products  fxfy  are  obtained,  and  the  quotients  f^y/Ixfy  worked 
out.  The  sum  of  these  last  quantities  gives  S,  which  may  then 
be  substituted  in  the  formula.  For  the  above  problem  the 
work  may  be  arranged  as  follows : 

Table  73.   Showing  Calculation  of  the  Contingency  Coefficient  C 


A 

B 

C 

fy 

L 

fxy 

Jxy 
fxfy 
Jxy /JxJy 

1 
1 
30 

.0333 

2 

4 
9 

.4444 

3 

2 

5 

7 

M 

4 

49 
.0816 

25 

70 

.3571 

2 

4 

1 

7 

N 

4 

16 

1 

49 

70 

21 

.0816 

.2286 

.0476 

3 

3 

0 

9 
21 

.4286 

fx 

7 

10 

3 

20 

We  thus  find  S  =  1.7028,  whence 
C 


-^. 


7028 


=  V.4127  =  .64. 


7028 

♦  Yule,  Introduction  to  Statistics,  p.  65. 
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For  further  illustration  and  in  order  to  compare  the  result 
by  this  method  with  that  by  the  correlation  ratio,  we  shall  also 
work  out  the  contingency  coefficient  for  the  attitude  and  in- 
telligence ratings  given  in  Table  69.  A  table  of  squares  is  of 
course  necessary  in  all  such  work. 

Calculation  of  the  Contingency  Coefficient  for  Tulchin's  Data 


Annoying 

Unsympathetic 

Sympathetic 

fy 

5 

219 

224 

5 

25 

75,712 
.000330 

47,961 

1,137,472 

.042165 

24 

12 

1213 

1249 

4 

576 

144 

1,471,369 

422,162 

380,945 

6,342,422 

.001364 

.000378 

.231989 

105 

103 

2451 

2659 

3 

11,025 

10,609 

6,007,401 

898,742 

810,995 

13,502,402 

.012267 

.013081 

.444914 

131 

108 

1021 

1260 

2 

17,161 

11,664 

1,042,441 

425,880 

384,300 

6,398,280 

.040295 

.030351 

.162925 

73 

82 

174 

329 

1 

5,329 

6,724 

30,276 

111,202 

100,345 

l,670,6f52 

.047922 

.067009 

.018122 

fz 

338 

305 

5078 

5721 

5  =  1.113112,    S-1  =  .113112 
C  =  -^  /  --^^A^:^^  =  .319 


=v 


1.113112 


In  his  text  on  statistics  Mr.  Yule*  has  shown  that  for  t 
categories  each  way  the  contingency  coefficient  has  a  maximum 


value  of 


4 


t-l 

t 


and  that  for  such  a  table  the  largest  value  for 


C  is  given  as  follows : 

*  G.  Yule,  Introduction  to  Statistics,  p.  66. 
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Ut=    2,  C  cannot  exceed  0.707. 

If  ^  =  3,  C  cannot  exceed  .816. 

If  ^  =  4,  C  cannot  exceed  .866. 

If  ^  =  5,  C  cannot  exceed  .894. 

If  ^  =  6y  C  cannot  exceed  .913. 

If  ^  =  7,  C  cannot  exceed  .926. 

If  ^  =  8,  C  cannot  exceed  .935. 

If  ^  =  9,  C  cannot  exceed  .943. 

If  ^  =  10,  C  cannot  exceed  .949. 

It  is  well  therefore  to  restrict  the  use  of  the  coefficient  of  con- 
tingency to  5  X  5  fold  or  finer  classification  whenever  possible. 
For  low  association  values,  however,  the  above  difficulty  does 
not  enter  in  any  marked  degree  and  the  contingency  method 
is  always  valuable  in  making  a  preliminary  analysis  of  a  table 
as  illustrated  by  Professor  Pearson  in  his  Tables.*  For  the 
example  on  page  276  the  method  of  contingency  is  as  good 
as  the  correlation  ratio,  and  the  two  results  found  are  in  fairly 
close  agreement. 

The  correction  j  for  broad  grouping  in  the  case  of  the  con- 
tingency coefficient  becomes 

^    1    r  Correction  to  the  con-1 

cC  = '  <  tingency  coefficient  for  >    (129) 

TxcTyc     \^        broad  grouping        J 

where  r^.^  and  Vyc  are  given  by  formula  (123).  For  the  problem 
in  Table  66  Professor  Pearson  finds  C  =  .291.  The  values  for 
r^c  and  Tyc  may  be  easily  obtained  from  the  work  in  Table  67, 
that  is, 

r,c  =  V.9319  =  .965    and    Tyc  =  V.8267  =  .909. 

Substituting  these  results  in  formula  (129),  we  find  that 

291 
'^  ~  .965  X  .909  "  •^^^• 

This  is  again  in  close  agreement  with  Pearson's  result,  cC  =  .334, 
worked  out  with  another  corrective  formula  (loc.  cit.  p.  131). 

*  Pearson's  Tables,  p.  xxxv.  f  Biometrika,  Vol.  IX,  p.  130. 
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Professor  Pearson  concludes  his  paper  with  the  remark  that 
for  contingency  tables  of  5  X  5  or  6  X  6  the  corrective  factors 
will  be  small,  but  for  4  X  4  or  3  x  3  tables  the  corrections  are 
important  and  should  always  be  made. 

The  probable  error  of  C  is  rather  awkward  to  work  out.  It 
is  given  by  the  formula 


•6745 
P.E.c  =  -—=- 


\l/3 


4>= 


where 


<t> 


2  _  J^  T 


N 


(H-(t)2)3      J 


Probable  error"! 
of  contingency 
1^     coefficient 


(130) 


Jxy 


N 


and 


Jxy 


JxJy 

N 

'■JxJy 

N 


=  S-1 


JxJy 

N 


It  is  therefore  necessary  to  work  out  \l/^  by  entering  each  cell. 


7.  Correlation  from  Ranks 

When  the  data  are  ranked  in  order  of  magnitude  a  rough 
measure  of  the  correlation  is  given  by  Spearman's  formula, 

ay,(j)   _  |i  \2      f  Spearman's  formula! 
P='^-      ,..,!o       ;.    »  \  based   on   rank  dif-  [    (131) 


A^(A^2  _  1) 


ferences 


where   Vx    and   Vy    are    the    ranks   of    the    X  and    Y   items, 
respectively. 

The  above  formula  may  be  readily  obtained  from  the  product- 
moment  formula  by  setting  X  =  Vs  and  Y  =  Vy.  By  noting 
that  the  sum  of  the  squares  of  the  first  N  integers  is  given 
by  iV(2  N  -\-  \){N  -\- 1)/6,  the  remainder  of  the  proof  may  be 
worked  out  by  forming  2x2/,  (Ts,  and  a^  and  is  left  as  an  exer- 
cise for  the  student.  It  may  also  be  shown  that  p  ranges  in 
value  from  —  1  to  1. 
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The  calculation  of  p  is  very  simple  as  shown  in  the  following 
example,  which  is  limited  to  10  cases  for  illustration.  After 
ranking  the  items  in  the  two  series  by  the  method  of  Chapter 
II,  the  computation  may  be  arranged  as  shown  in  Table  74. 

Table  74.  Illustrating  the  Calculation  of  Correlation  from  Ranks 


X 

Y 

Px 

»V 

(Vx    -    Vy) 

(PX  -  Vy)^ 

171 

117 

2 

6 

-4 

16 

169 

153 

3 

1.5 

1.5 

2.25 

128 

131 

7 

4 

3 

9 

141 

105 

5 

7 

-2 

4 

106 

71 

9 

10 

-  1 

1 

146 

130 

4 

5 

-  1 

1 

87 

80 

10 

9 

1 

1 

114 

101 

8 

8 

0 

0 

187 

153 

1 

1.5 

-0.5 

0.25 

133 

132 

6 

3 

3 

9 
43.5 

p  =  l- 


6  X  43.5 
10  X  99 


=  1 


261 
990 


=  .74 


One  difficulty  in  the  use  of  the  above  formula  arises  from  the 
fact  that  a  rectilinear  form  of  distribution  is  assumed,  that  is, 
one  frequency  for  each  rank.  In  order  to  overcome  this  diffi- 
culty Professor  Pearson*  has  given  a  corrective  formula. 


w 


r  Pearson's  correction  1 


r  =  2  sin  —  />,    ^  to  Spearman's  rank  V    (132) 
^  1^  coefficient  J 

which  converts  p  into  r  under  the  assumption  of  a  normal  dis- 
tribution. This  correction,  however,  is  small,  amounting  to  .018 
at  most,  and  is  usually  not  important  because  lack  of  normality 
may  introduce  an  error  several  times  as  large  as  the  correction. 
The  student  is  urged  to  make  up  a  short  example  in  which  the 
distributions  are  very  skewed.  The  correlation  coefficient  and 
rank  coefficient  should  be  computed  and  the  difference  noted. 


*  Karl  Pearson.  Mathematical  Contributions  to  Evolution,  XVI,  p.  12.    Cam- 
bridge University  Press,  London. 
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Another  objection  to  p  appears  when  there  are  a  good  many 
ties  in  rank.  This  difficulty  is  illustrated  by  the  following  series : 


X 

y 

Vx 

"y 

10 

30 

5 

3 

20 

30 

4 

3 

30 

30 

3 

3 

40 

30 

2 

3 

50 

30 

1 

3 

If  the  value  for  p  be  worked  out  it  becomes  .50  instead  of  zero 
as  found  by  the  product-moment  method.  The  above  example 
is,  of  course,  extreme,  but  a  large  proportion  of  ties  in  rank  will 
generally  be  found  to  produce  a  correspondingly  large  error. 

When  the  data  are  necessarily  given  in  the  form  of  ranks,  and 
when  there  are  not  many  ties  in  rank  (say  less  than  one  fifth  of 
the  items),  Spearman's  rank  formula  may  be  conveniently  used 
to  give  a  rough  indication  of  the  correlation.  While  the  arith- 
metic is  simple  for  short  series,  the  ranking  and  squaring  become 
laborious*  beyond  50  cases.  The  method  is,  therefore,  recom- 
mended for  about  20  to  40  cases.  With  more  data  the  product- 
moment  method  is  theoretically  better  and  more  rapid. 

The  probable  error  of  r  given  by  formula  (132)  is 

D  „  ^  .7063  (1  -  r^)       r  Probable  error  of  r\     x.n„. 

/'.A.r  (fromp)  -  7=  |from  formula  (132)  /     ^^'^'^^ 


Vn 


EXERCISES 

1.  Work  out  the  product-moment  correlation  coefficients  for  the 
problems  of  Exercise  1,  Chapter  IX,  using  the  method  illustrated  in 
section  2  of  the  present  chapter.  Assuming  that  the  means  of  the 
arrays  are  also  needed,  compare  the  total  amount  of  arithmetic  with 
that  required  by  the  use  of  the  correlation  form. 

*A  useful  table  for  the  calculation  of  p  is  givon  in  Tables  for  the  Rank  Differ- 
ence Method.   The  Scott  Company  Laboratory,  Philadelphia,  1920. 
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2.  Work  out  the  contingency  coefficient  for  the  following  problem : 

Correlation  between  Occupational  Status  of  Parent  and 

Nativity  of  Child 


Nativity 

Occupation" 

Total 

1 

2 

3 

4 

5 

A 

8 

8 

19 

15 

50 

B 

4 

28 

15 

51 

17 

115 

C 

18 

133 

26 

65 

43 

285 

D 

11 

73 

19 

35 

12 

150 

Total 

33 

242 

68 

170 

87 

600 

Key  :  A  =  Professional  Class 
B  =  Merchant 
C  =  Skilled  Labor 
D  =  Unskilled  Labor 


1  =  Child  born  in  United  States 

2  =  One  parent  born  in  United  States 

3  =  Both  parents  born  in  United  States 

4  =  One    grandparent    born    in    United 

States 

5  =  Both   grandparents   born   in   United 

States 

(C  =  .294.   Ans.) 


3.  Compute  the  coefficient  of  contingency  for  the  following  table 
Correlation  between  Nativity  and  Mental  Level  of  Child 


Nativity 

Mental  Category 

Total 

F 

B 

D 

N 

S 

VS 

4 

5 

10 

4 

19 

3 

3 

10 

12 

3 

28 

2 

4 

5 

24 

52 

17 

5 

107 

1 

2 

5 

9 

31 

5 

52 

Total 

6 

10 

36 

98 

44 

12 

206 

Key :  F  =  Feeble-minded 
B  =  Border-line 
D  =  Dull 


N  =  Normal 
S  =  Superior 
VS  =  Very  Superior 

(C  =  .434.    Ans.) 


Note.  The  data  for  Exercises  2  and  3  were  furnished  by  Mrs.  Irene  Lange. 
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4.  Work  out  the  correlation  from  ranks  for  the  Otis  and  Terman 
test  scores  of  Exercise  1,  Chapter  II.  (p  =  .7165.    Ans.) 

Compare  the  amount  of  arithmetic  with  that  in  the  product- 
moment  method. 

5.  Do  the  exercise  suggested  at  the  bottom  of  page  278. 

6.  Compute  rjxy  for  the  following  table. 


Health 

Nutrition 

Total 

C 

B 

A 

V.R. 

4 

11 

5 

20 

R. 

56 

56 

12 

124 

N. 

50 

49 

16 

115 

R.D. 

140 

168 

37 

345 

D. 

94 

89 

16 

199 

V.D. 

6 

5 

1 

12 

Total  .  .  . 

350 

378 

87 

815 

(T]^    =  .096.    .4ns.) 


7.  Work  out  C  for  the  data  of  Exercise  6.  (C  =  .119.   Ans,) 

8.  Compute  r  for  the  data  of  Exercise  6,  using  formula  (120). 


(r  =  .0768.   Ans.) 


CHAPTER  XV 

PARTIAL  AND  MULTIPLE   CORRELATION 

1.  The  Meaning  of  Partial  Correlation 

In  dealing  with  correlation  thus  far  the  relationship  of  only 
the  two  associated  characters  has  been  considered.  Each  of 
these,  however,  is  dependent  upon  many  other  factors  which 
may  influence  the  observed  correlation  to  a  considerable  extent. 
The  problem  of  partial  correlation  is  to  find  the  relationship 
between  two  variables  when  the  influence  of  other  variables 
has  been  eliminated  or  when  such  factors  have  been  held 
constant. 

The  conditioning  factors  may  be  eliminated  by  experimental 
procedure  or  by  the  use  of  a  formula  as  illustrated  by  the  fol- 
lowing example.  The  factors  considered  are  mental  age,  chron- 
ological age,  and  ossification  ratio,  the  latter  being  an  index  of 
anatomical  development  based  on  measurements  of  the  wrist 
bones.  The  problem  is  to  discover  the  relationship  between 
mental  and  physical  development  when  the  influence  of  age 
has  been  eliminated. 

Data  for  the  experimental  solution  of  this  problem  were  fur- 
nished by  records  of  the  Laboratory  Schools  of  The  University 
of  Chicago,  the  work  being  done  by  Miss  Ethel  Abernethy* 
and  others.  The  children  were  all  measured  within  a  few  days 
of  each  birthday.  In  the  table  given  on  page  284  it  will  be  noted 
that  not  one  of  these  coefficients  is  significant  in  comparison 
with  its  probable  error.  We  may  therefore  conclude  that  for 
children  of  the  same  age,  carpal  development  and  mental  age 
are  entirely  unrelated. 

*  Ethel  M.  Abernethy,  "Correlation  in  Physical  and  Mental  Growth,"  Journal 
of  Educational  Psychology,  October  and  November,  1925. 
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Table  75.  Correlation  of  Mental  Age  and  Ossification  Ratio  (Girls) 


Chronological  Age 

Number  of  Cases 

Correlation  Coefficient 

6-12 

120 

+  .016  ±  .062 

13 

44 

-  .137  ±  .100 

14 

62 

-  .139  ±  .084 

15 

29 

-  .174  ±  .122 

16 

45 

-  .022  ±  .101 

17 

37 

+  .041  ±  .111 

Turning  now  to  the  method  of  partial  correlation,  we  may 
designate  the  three  variables  as  follows : 

1  =  ossification  ratio, 

2  =  mental  age, 

3  =  chronological  age. 

The  correlation  between  1  and  2  for  3  fixed  is  required  and  is 
given  by  the  formula 


^2.3 


y.o  — ri'i-ro'^  r  Partial  correlation  1 

<  coefficient  for  three  !-    (134) 


V(l-rf3)(l-4) 


L 


variables 


J 


By  taking  several  hundred  cases  ranging  in  age  from  5  to 
20  years,  the  three  necessary  correlations  were  found  to  be 
ri2  =  .75,  ri3  =  .87,  and  r23  =  .83  (girls).  When  these  values 
are  substituted  in  the  above  formula  we  find  that 


ri2.3 


.75-.87x.83 


V[l-(.87)^'][1- (.83)^1 


=  .101.* 


For  320  cases  the  probable  error  of  this  result  is  .037,  and  for 
360  boys  we  find,  similarly,  ri2.3  =  .089  ±  .035.  Neither  of  these 
coefficients  is  three  times  its  probable  error,  so  they  are  to  be 
regarded  as  insignificant.  The  above  method  then  gives  re- 
sults in  entire  agreement  with  the  experimental  procedure  of 
Miss  Abernethy.    It  should  be  noted  that  the  original  correla- 


*  When  a  calculating  machine  is  available  Miner's  Tables  for  Vl  —  r-  (Johns 
Hopkins  Press)  are  most  convenient.  For  logarithmic  calculation  using  Holzinger's 
Tal)les,  VII,  see  section  2. 


PARTIAL  AND  MULTIPLE  CORRELATION         285 

tion  of  .75  between  mental  age  and  ossification  ratio  is  thus 
due  entirely  to  the  correlation  of  each  of  these  variables  with 
chronological  age. 

In  Chapter  IX,  section  8,  it  was  showTi  that  some  selection 
lowers  the  correlation  between  traits.  Thus  if  a  narrow  age 
range  were  used  we  should  expect  altered  correlations  between 
ossification  ratio  and  age  and  between  mental  and  chronologi- 
cal age,  with  a  resulting  lower  correlation  between  the  physical 
and  mental  traits.  By  restricting  the  range  of  age  to  zero,  we 
reach  rigorous  selection  the  effect  of  which  has  been  noted 
above.  Partial  correlation,  then,  may  be  regarded  as  a  method 
for  obtaining  relationships  under  rigorous  selection  of  certain 
conditioning  variables. 

"Uliile  it  is  usually  best  to  isolate  factors  experimentally  it 
is  often  not  advisable  to  do  so  because  of  the  great  reduction 
in  the  number  of  cases.  The  chief  factors  to  be  controlled  in 
the  above  laboraton.^  data  are  age,  sex,  and  race.  If  all  these 
are  eliminated  by  selecting  the  cases,  groups  of  8  to  15  result, 
and  correlations  based  on  such  small  numbers  are  almost 
worthless.  The  method  of  partial  correlation  makes  it  possible 
to  use  a  much  larger  body  of  data,  eliminating  the  conditioning 
factors  by  means  of  formulas.  It  is  therefore  a  very  useful  and 
powerful  tool  in  analyzing  the  relationships  in  a  set  of  corre- 
lated variables. 

Partial  correlations  may  be  worked  out  for  any  number  of 
variables,  but  the  arithmetic  bej'ond  four  variables  becomes 
ver>"  lengthy  and  tedious.  Of  the  various  methods  of  computa- 
tion, solution  by  logarithms  is  probably  best  for  students  who  do 
not  have  the  use  of  a  calculating  machine.  In  the  next  section 
we  shall  therefore  give  examples  of  three-variable  and  four- 
variable  correlations,  using  logarithms  and  straight  arithmeti- 
cal substitution. 

One  important  caution  to  be  obsen^ed  at  the  outset  is  to  use 
the  method  of  partial  correlation  only  in  case  the  tables  from 
which  the  original  coefficients  are  obtained  are  sensibly  linear. 
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The  procedure  is  then  to  find  the  product-moment  correlations 
for  all  the  variables  studied,  test  the  tables  for  linearity  by 
the  method  of  Chapter  X,  and  then  substitute  the  coefficients 
obtained  in  suitable  formulas,  provided  the  regressions  are  all 
sufficiently  linear.  In  case  non-linear  relationships  are  found 
other  methods  must  be  employed,  such  as  the  procedure  de- 
scribed in  the  last  section  of  Chapter  X. 

2.  Partial  Correlation  for  Three  and  Four  Variables 

In  dealing  with  several  variables  it  becomes  necessary  to  use 
a  suitable  notation  for  the  various  coefficients  which  arise.  If 
the  variables  are  designated  as  Xi,  X2,  A'3  •  •  •  A«,  the  original 
correlations  ri2,  ^13  •  •  •  r23,  r24  •  •  •  r(n-i)n  are  known  as  co- 
efficients of  zero-order,  and  the  subscripts  are  called  primary 
subscripts. 

Correlations  such  as  ri2.3,  ^23.1,  and  rvz.n  are  regarded  as 
coefficients  of  the  first-order,  while  the  correlations  ri2.34,  ros.n, 
and  r34.i2  are  said  to  be  coefficients  of  the  second-order,  and  so 
on.  The  subscripts  following  the  decimal  point  are  known  as 
secondary  subscripts. 

The  general  formula  for  the  partial  correlation  of  the  order 
{n  —  2)  for  n  variables  is  given  by 

^12.34  ...{n—D  —  TinM  ■    •  (n     l^''2n.34  ■  ■  .(n-1)       /ioc\ 
ri2.34...n  = .  .  •     (135) 

V   l^  ~~  ''ln.34.  .  .  (n-l)Jll  ~  ''20.34.  .  .  (n-l)J 

{Partial  correlation  coefficient  of  the  order  (n  —  2)} 

This  gives  the  correlation  between  variables  A'l  and  X2  when 
the  remaining  n  —  2  variables  have  been  held  constant. 

Yule*  has  shown  that  the  order  of  the  secondary  subscripts 
is  indifferent,  so  that  r  12.34  =  ri2.43,  and  ri2.34r>  =  ri2.3r,4  =  ?*i2.r>43, 
etc.  These  alternative  formulas,  as  we  shall  see,  furnish  very 
useful  checks  on  the  arithmetic,  since  they  give  independent 
solutions  for  the  various  partial  coefficients. 

*  Yule,  Introduction  to  Statistics,  chup.  xii.  Charl(>s  Griffin  &  Co.,  London,  19124. 
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Using  formula  (135),  we  may  now  write  down  all  the  pos- 
sible correlations  from  three  variables.   These  are  evidently 

r,,,,  =  —Jli^M^==,  (136a) 

V[i  -  4][1  -  4] 

V[l  -  rh][l  -  rl] 
and  ^23.1  —  ■— ;=  (136c) 

{Partial  correlations  of  first-order} 
Similarly,  in  the  case  of  four  variables  we  shall  have 
ri..34  =       '•'^3  -  n4.3r^.3       , 

^[1  -  ^U.3][l  -  4.3] 

ri..43=     /■^^-'•"^'•^^.^       ,  (137b) 

V[l  -  4.4][1  -  4.4] 

n3.24=     /'3^-^'«^'-3^-^       ,  (137  c) 

and  ri3.42  =  -^IM^^llg^^^ ,  (i37d) 

etc.  {Partial  correlations  of  second-order) 

Since  the  two  primary  subscripts  may  be  selected  from  four  in 
4C2  =  6  ways,  there  are  evidently  six  possible  partial  correlations 
of  the  second-order  with  four  variables.  Each  of  these  six  may 
be  obtained  in  two  ways  as  a  check,  for  example,  ri2.:u  =  ^12.43. 
The  total  number  of  arrangements  of  the  subscripts  for  four 
variables  is  therefore  twelve.  The  student  should  write  all  these 
out  in  full  in  order  to  become  familiar  with  the  formula  and 
with  the  notation  employed. 

As  an  illustrative  problem  we  shall  take  some  results  found 
by  Mr.  Cyril  Burt.  The  variables  considered  may  be  defined 
as  shown  in  the  list  on  the  following  page. 
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Xi  =  mental  age  on  an  English  revision  of  the  Binet  scale, 

X2  =  school  attainment  expressed  in  educational  age, 

X3  =  intellectual  development  as  measured  in  age  units  by 

Burt's  reasoning  test, 
X4  =  chronological  age. 

The  observed  correlations  of  zero-order  may  be  arranged  as 
follows : 

Table  76.   Burt's*  Intercorrelations 


Xi 

X2 

X3 

Xi 

X2 

.91 

Xz 

.84 

.75 

X4 

.83 

.87 

.70 

Burt  does  not  give  the  tables  on  which  these  correlations  are 
based,  but  we  shall  assume  they  are  linear  and  proceed  to  the 
calculation  of  the  partial  coefficients. 

The  total  number  of  different  correlations  of  first-order  is 
evidently  twelve,  since  two  variables  may  be  selected  from  four 
in  six  ways,  each  pair  furnishing  two  correlations  on  account  of 
the  interchangeability  of  the  secondary  subscripts. 

In  working  out  these  values  it  is  best  to  arrange  the  calcula- 
tion as  in  Table  77  so  as  to  identify  each  step  in  the  computa- 
tion. In  the  following  work  the  logarithms  of  Vl  —  r'~  were 
taken  from  Holzinger's  Table  VII  and  rounded  off  to  four  places. 
Products  such  as  .84  x  .83  =  .6972  have  also  been  rounded  off 
to  three  figures  (.697),  and  four-place  logarithms  used  for  the 
remainder  of  the  computation.  Greater  accuracy  than  this  is 
unnecessary  when  the  original  coefficients  are  correct  to  only 
two  places  (see  Chapter  V). 

The  first  item  in  column  (2)  is  obtained  by  forming  the 
product  .84  x  .75  =  .630,  that  is,  the  product  of  the  coefficients 
in  the  first  group  of  three  not  in  line  with  .91  ;  next,  .910  —  .630 
=  .280  gives  the  first  entry  in  column  (3) ;  the  logarithm  of  .280  is 

*  Cyril  Burt,  Mental  and  Scholastic  Tests,  p.  182.    King  and  Son,  Ltd.,  London,  1921. 
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9.4472  —  10  as  shown  in  column  (4).  This  completes  the  calcula- 
tion up  to  the  logarithm  of  the  numerator.  The  logarithm  of  the 
denominator  of  rio.z  is  now  obtained  by  adding  the  logarithms 
(from  Holzinger's  Table  VII)  of  Vl  —  rf.:^  and  v  1  —  rfg,  which 
are  listed  in  column  (Ij.  The  first  entry  in  column  (5>  is  then 
found  by  adding  9.7345-10  and  9.8205-10,  gi\ing  9.5550-10. 
To  complete  the  calculation  for  ri2.3  it  is  only  necessan*  to  sub- 
tract the  logarithm  of  the  denominator  from  the  logarithm  of 
the  numerator  (9.8922  —  10  in  column  (6)),  and  look  up  the  cor- 
responding number,,  or  anti-logarithm  ^.780  in  column  ^7;;.  The 
remaining  correlations  are  calculated  in  a  similar  way. 

In  finding  the  coefficients  of  second-order  the  first-order 
values  just  found  may  again  be  arranged  in  convenient  groups 
of  three,  and  the  same  scheme  of  calculation  carried  out.  as 
illustrated  in  Table  78.  A  complete  check  on  the  arithmetic 
is  given  by  two  solutions  for  each  second-order  coefficient  with 
formulas  such  as  ( 137>.  Each  of  the  six  second-order  values  is 
thus  worked  out  twice,  as  shouTi  in  the  table  on  page  291. 

With  zero-order  coefficients  correct  to  only  two  places  no 
greater  accuracy  can  be  expected  in  the  higher-order  coefficients, 
but  three-place  values  have  been  used  in  Table  78,  so  that  the 
final  results  may  be  rounded  off  to  two  places. 

The  interpretation  of  coefficients  such  as  those  found  is 
rendered  difficult  because  of  the  fact  that  the  first  three  vari- 
ables are  all  measures  of  the  same  thing  to  a  certain  extent, 
and  holding  one  or  more  of  them  constant  gives  a  result  of 
doubtful  meaning.  The  variables  A'l  and  X3  are  both  measures 
of  intelligence,  but  ri2.34  =  +  .61  while  ro.s.u  =  —  .08,  the  latter 
coefficient  being  negligible.  Burt  interprets  the  coefficient  .61 
as  follows : 

"With  both  age  and  'intelligence'  ("reasoning  ability^  con- 
stant, the  partial  correlation  between  school  attainments  and 
Binet  results  remains  at  .61  •  •  •.  There  can,  therefore,  be  little 
doubt  that  with  the  Binet-Simon  scale  a  child's  mental  age  is 
a  measure  not  only  of  the  amount  of  intelligence  with  which 
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he  is  congenitally  endowed  ...  it  is  also  an  index,  largely  if  not 
mainly,  of  the  mass  of  scholastic  information  and  skill  .  .  .  which 
he  has  accumulated  in  school."   (Op.  cit.  p.  182) 

The  coefficient  r23.i4  =  —  .08,  on  the  other  hand,  would  seem^ 
to  show  that  for  children  of  given  chronological  and  mental 
ages,  reasoning  ability  (or  ''intelligence"  as  Burt  calls  it)  is 
entirely  unrelated  to  scholastic  achievement.  Burt  and  others 
have  claimed  that  this  result  shows  the  reasoning  test  to  be  a 
pure  measure  of  intelligence  "independent  of  schooling."  If 
mental  age,  however,  is  a  measure  of  both  ''intelligence"  and 
achievement,  the  partial  correlation  above  will  necessarily  be 
low  because,  by  fixing  Xi,  the  variables  X2  and  X3  are  thereby 
both  restricted.  It  may  also  be  noted  that  "schooling"  as  used 
here  is  a  measure  of  relative  achievement  in  school.  The  fact 
that  the  Binet  test  has  higher  correlation  with  such  achieve- 
ment than  does  Burt's  test,  indicates  that  the  former  is  the 
better  guide  in  predicting  scholastic  success  and  is  therefore  a 
better  intelligence  test  for  practical  purposes. 

3.  Partial  Regression  Equations  for  Three  Variables 

When  two  variables,  Xi  and  X2,  are  involved  it  has  been 
shown  in  Chapter  IX  that  the  equation  for  predicting  the  most 
probable  value  of  Xi  for  a  given  value  of  X2  is  given  by  the 
regression  equation 

Xi  =  ri2  —  ^"2  +  constant  =  612^2  +  constant.  (138) 

cr2 

{Regression  equation  for  two  variables} 

This  same  method  of  prediction  will  now  be  applied  to  sev- 
eral variables,  Xi,  Xo,  X3  •  •  •  Xn.  The  regression  equation  for 
estimating  Xi  from  the  remaining  n  —  1  variables  is 

Xi  =  bl2.34  •  ■     n^2  +  &13.24  •  •  •  n^Z  H +   bi  n.23  •  •  •  (n  -  l)^n  +  C,    (139) 

which  is  known  as  a  linear  function  of  the  X's.   The  quantities 
&i2.3i---r,,  61.3. 24 •••n,  •  •  .&in.23  •  •  •  {r,-\)  and  C  3X0  constants  to  be 
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so  chosen  that  the  squared  differences  between  the  observed 
values  Xi  and  the  predicted  values  Xi  shall  be  as  small  as  pos- 
sible ;  that  is,  so  that  2  (Xi  —Xi)^  =  a  minimum.  These  differ- 
ences, Xi  —  Xi,  are  also  known  as  errors  of  estimate  or  residuals. 
By  applying  the  method  of  least  squares  in  a  manner  similar 
to  that  in  Chapter  IX,  it  may  be  shown  that 

L  _y  ^134-  • .  n     f  Regression  coefficient  1       /iyiA\ 

C?12.34  •  •  •  n  —  ^^12.34  •  •  •  n M        A,,         ^       /         ox     h       (1^0) 

0*2  34  ...  n     I   01  the  order  (n  —  2)    j       ^        "^ 

where 

0-1.23  .  •  •  n  =  OTi  V(l  -  J'ly  (1  -  ''l'3.2)  •••(!-  rfn.23  -.•(„-  !))•      (141) 

{Standard  deviation  of  the  order  {n—1)} 
The  probable  error  of  estimate  is  of  course  given  by 

P.  £.es/=. 6745  0-1.23 -.n.  (142) 

{Probable  error  of  estimate} 

The  value  furnished  by  (140)  is  known  as  a  partial  regression 
coefficient.  It  gives  the  average  change  in  the  dependent  vari- 
able (left  member  of  the  equation)  for  a  unit  change  in  the 
variable  to  which  it  is  attached,  when  all  the  remaining  vari- 
ables are  kept  constant. 

It  will  be  noted  that  the  subscripts  are  so  arranged  that  the 
position  of  any  regression  coefficient  is  uniquely  determined. 
Thus  for  613.24  ••  n  the  primary  subscript  1  indicates  that  Xi  is 
the  dependent  variable,  while  3  shows  the  variable  Z3  to  which 
the  coefficient  is  attached.  The  remaining  secondary  subscripts 
following  the  decimal  point  merely  show  the  number  of  variables 
involved,  and  their  arrangement  is  a  matter  of  indifference. 

In  order  to  illustrate  the  above  formulas  we  shall  next  write 
out  in  full  the  equations  for  three  variables.  From  equations 
(140)  and  (141)  we  find  that 

Xi  =  612.3X2  +  613.2X3  +  Ci 

=  ^12.3 ^  X2  +  ^13.2  —  X3  +  Cl 

0'2.3  0"3.2 


_  (71  Vl  -  rf,  a,  Vl  -  ril 

=  ri2.3 ,  X2  -f  ri3.2 /  ^  X3  +  Cl. 

(72  V  1  -  r23  (73  VI  -  r;f2 
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Upon  substituting  the  values  or  ri2.3  and  ri3.2,  we  obtain 

^^  ^  o;i  (ri2  -  ri3r23)  ^^^^  {n,  -  n.r^z)  ^^  ^  ^^       ^^^3^ 
0-2       1  -  r^^  0-3       1  -  r|3 

and,  similarly, 

^  a_2  (ri2  -  ri3r23)  ^^  ^  og  (r23  -  ri2ri3)  ^^  ^  ^^      ^^^^^ 
0-1       l-rfg  0-3       l-rfg 

and        J  ^  ^  (^3  -  ri2r23)  o;3  (r23  -  n^nz)      ^  ^^^     ^^^^^^ 

0-1       l-rf2  <^2      l-rfg 

{Regression  equations  for  three  variables,  short  form} 

The  general  expression  for  C  is  given  by 

C--=Mi  -  bi2.34  •  •  •  nM2-  ?7l3.24  •  •  •  nMz ^i  „.23    •  •  (n  -  l)^^^.    (146) 

[Constant  term  in  regression  equation} 

When  an  estimate  has  been  made  with  a  regression  equation 
it  is  necessary  to  know  something  about  the  reliability  of  the 
obtained  prediction.  The  standard  error  of  estimate  for  pre- 
dicting Xi  from  n—1  other  variables  is  the  standard  deviation 
of  the  residuals  given  by  equation  (141)  and  is  interpreted  in 
the  same  way  as  a  Vl  —  r^  of  Chapter  IX.  For  the  above 
equations  we  thus  find 


0-1.23  =  0-1  Vl  -  rf2  Vl  -  Tils  =  0-1  Vl  -  rfa  Vl  -  rf2.3, 
or 


1  —  ^12  —  ^13  —  ^23  +  2  ri2ri37'23         (J\  VSi23 


>'  1-^23  Vl  -  rl 


1  —  ^12  —  ^13  —  ^23  +  2  ri2ri3r23         (J2   V  S 


123 


0-2.13  =  0-2  \ -^ -. ^      . x^^     (i^gj 


^13  Vl  -  rf3 


1  —  ^12  ~  ^13  ~  ^23  +  2  ri2ri3r23         0-3  VSi 


0-3.12  =  0-3  \^ -2 =  — ==,     (149) 

>'  1-^^12  Vl  -  rf2 

{Standard  deviations  of  the  second-order; 


where  Si 23  is  the  sum  of  the  terms  in  the  numerator  of  (147). 
The  probable  errors  of  estimate  are  of  course  obtained  by  mul- 
tiplying the  above  values  by  .6745. 
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In  dealing  with  a  three- variable  problem  for  which  only  zero- 
order  coefficients  are  required,  formulas  (143),  (144),  and  (145) 
will  be  found  convenient.  When  the  partial  correlations  are 
available,  however,  formulas  like  (139),  (140),  and  (141)  will  be 
found  much  simpler  to  employ.  The  computation  for  the  former 
equations  will  be  illustrated  by  an  example  in  which  success  in 
first-year  college  work  is  predicted  by  the  average  of  four  years' 
work  in  high  school  and  an  intelligence  test.  The  three  variables 
and  zero-order  coefficients  obtained  from  a  sample  of  75  cases 
may  be  given  as  follows : 

Xi  =  criterion  of  success  =  average  mark  for  first-year  col- 
lege work. 

X2  =  predictor  =  average  mark  from  four  years  in  high  school. 

X3  =  predictor  =  score  on  the  Brown  Intelligence  Test. 

Ml  =  78.0  %,  0-1  =  10.21  %,  ri2  =  M6 ; 

M2  =  87.2  %,  0-2  =  6.02  %,  ri3  =  .750 ; 

Ms  =  32.8  pts.,         0-3  =  10.35  pts.,         r23  =  .628. 

It  should  be  noted  that  the  number  of  cases  (N  =  75)  is  too 
small  to  give  very  reliable  results,  but  the  above  example  will 
be  used  to  illustrate  the  calculation. 

By  Blakeman's  test  the  correlations  all  proved  to  be  linear, 
so  that  the  method  of  partial  correlation  is  justified  in  this 
problem. 

The  equation  required  is  (143),  or 

V        0-1  (ri2  -  ri3r23)  v   _l  ^1  (^13  -  ri2r23)  v   _l  /- 

Ai  — :; A2  -h ; ^ A3  -|-  Ci, 

0-2       1  -  r23  0-3        1  -  r23 

the  computation  for  which  may  be  arranged  as  in  the  table  on 
page  296. 

Inasmuch  as  the  zero-order  values  are  given  to  three  and  four 
significant  figures,  a  five-place  logarithm  table  has  been  used. 
The  logarithms  of  1  —  r^  are  given  directly  by  Holzinger's 
Table  VI.  It  will  be  found  necessary  to  observe  the  arrange- 
ment of  the  quantities  in  the  formula  very  carefully  in  order  to 
combine  the  proper  logarithms. 
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Table  79.   Showing  Calculation  of  First-Order  Regression 

Coefficients 


(1) 

r 

(2) 

Product 

rr 

(3) 
Differ- 
ence 
r  —  rr 

(4) 
Log  (r  -  rr) 

(5) 

(6) 
Log  cr 

(7) 
Log  (1- 1 2) 

12 
13 
23 

.666 
.750 
.628 

.4710 
.4182 

.1950 
.3318 

9.29003 
9.52088 

1 

2 
3 

10.21 

6.02 

10.35 

1.00903 
0.77960 
1.01494 

9.78220 

(8) 

Log  Numerator 

[Cols.  (4)  and  (6)] 

(9) 
Log  Denominator 
[Cols.  (6)  and  (7)] 

(10) 

Log  Coefficient 

[Cols.  (8)  and  (9)] 

(11) 

Coefficient  of 

Regression 

First-Order 

0.29906 
0.52991 

0.56180 
0.79714 

9.73726 
9.73277 

612.3 
613.2 

.5461 
.5405 

Using  a  calculator  and  Miner's  Table  for  1  —  r-,  the  above 
computation  becomes  very  much  easier : 


and 


&12.3  = 
&13.2  = 


10.21  X  .195    _  1.99095 
3.6458 

3.38768 


6.02  X  .605616 
10.21  X  .3318 


=  .5461 

=  .5405. 


10.35  X  .605616       6.2681 

The  value  of  C  as  given  by  (146)  becomes 

C  -  78  -  .5461  X  87.2  -  .5405  X  32.8  =  12.65  .-.  12.65%, 

the  unit  being  the  same  as  for  Xi. 

Using  formula  (147),  the  probable  error  of  estimate  may  also 
be  worked  out  from  the  zero-order  coefficients : 

5i23  =  1  -  rf2  -  ^^3  -  rl  +  2  riorisrsa  =  .226932. 


log  VSvr^  =  9.67795 

log  (71  =  1.00903 

log  .6745  =  9.82898 

log  prod.  ^0.51596 

log  Vl-;v,  =  9.89110 

log.6745(7i.23  =  0.62486 

.-.  P.£;.i.23  =  4.22% 
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The  complete  regression  equation  therefore  becomes 
Xi  =  .546  X2  +  .540  Xs  +  12.65^0  ±  4.22%. 

It  should  be  noted  that  the  coefficients  .546  and  .540  may  not 
be  compared  directly,  but  that  each  gives  the  average  change 
in  Xi  for  a  unit  change  in  Xo  and  X3,  respectively,  when  the 
other  variable  is  held  constant.  Thus  an  increase  of  1  per 
cent  in  the  high-school  record  is  accompanied  on  the  average 
by  an  increase  of  .546  of  1  per  cent  in  the  college  record,  while 
an  increase  of  one  point  on  the  Brown  test  is  accompanied  by 
an  increase  of  .540  of  1  per  cent  in  college  standing. 

In  making  a  prediction  with  the  above  equation  it  is  only 
necessary  to  substitute  values  for  X2  and  X3.  A  student,  for 
example,  may  enter  the  University  with  a  high-school  average 
of  80  and  a  Brown  test  score  of  40.  L^pon  substituting  these 
values  in  this  last  equation  the  most  probable  standing  of  the 
student  in  college  at  the  end  of  the  freshman  year  will  be  given 
by  77.93  i  4.22.  It  is  therefore  an  even  chance  that  his  college 
rating  will  be  anywhere  from  73.71  to  82.15,  and  the  importance 
of  the  probable  error  of  estimate  is  seen  in  placing  a  reservation 
upon  the  accuracy  of  the  prediction.  For  a  second  student  with 
a  high-school  average  of  90  and  a  Brown  score  of  50  we  find, 
similarly,  Xi  =  88.79  ±  4.22.  Here  it  is  an  even  chance  that  this 
student's  college  average  will  be  between  84.57  and  93.01. 

The  question  sometimes  arises,  Why  predict  the  college  stand- 
ing of  students  when  it  is  already  known  in  this  problem  ?  The 
standing  of  only  the  sample  observed  is  known,  however.  This 
criterion  is  used  as  a  basis  for  determining  the  regression  equa- 
tion by  means  of  which  predictions  may  be  made  with  similar 
groups  for  which  the  college  standing  is  unknown.  It  is  assumed, 
therefore,  that  other  groups  will  possess  the  same  characteristics 
as  the  sample  studied  so  that  the  equation  may  also  be  applied 
to  them.  Needless  to  say,  this  assumption  is  often  not  fulfilled, 
but  the  forecast  by  means  of  the  regression  equation  is  one  of 
the  best  that  can  be  made  on  the  basis  of  past  experience. 
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4.  Some  Cautions  in  the  Use  of  Regression  Equations 

Estimates  by  means  of  regression  equations  may  fail  to  be 
reliable  for  several  reasons : 

a.  The  trend  of  the  data  in  the  observed  sample  may  be  im- 
perfectly represented  by  a  linear  function.  By  testing  all  of  the 
zero-order  regressions  for  linearity  this  objection  may  be  over- 
come. 

b.  Data  to  which  the  regression  equation  is  applied  may  not 
be  comparable  with  those  of  the  sample  from  which  the  equa- 
tion was  derived.  This  difficulty  may  be  illustrated  by  the  case 
of  a  high  school  with  unusually  low  or  high  standards  of  mark- 
ing. The  use  of  the  equation  in  the  last  problem  in  such  a  case 
would  give  misleading  results,  since  the  data  were  obtained  from 
a  normal  group.  It  would  probably  be  necessary  to  work  out  a 
separate  equation  for  such  schools. 

c.  The  correlations  of  various  orders  may  be  so  small  that  the 
probable  error  of  estimate  becomes  relatively  large.  If  this  con- 
dition prevails  predictors  having  higher  correlation  with  the 
criterion  must  be  sought,  or  their  number  must  be  increased,  as 
is  evident  from  inspection  of  formula  (141). 

d.  The  number  of  cases  in  the  sample  furnishing  the  predict- 
ing equation  may  be  so  small  that  the  regression  coefficients  are 
unstable.  The  probable  error  of  612.3  =  -546  in  the  last  prob- 
lem is  given  by  formula  (97)  of  Chapter  XIII,  that  is, 

P.K6:2.3  =  .6745     ""'■'' 

0'2.3    V  N 


=  .6745      ^'''^'=.m, 
0-2  (1  -  rli)  Vn 

or  612. 3  =  .546  ±.104. 

The  value  of  the  regression  coefficients  based  on  a  very  large 
number  of  cases  might  therefore  differ  considerably  from  the 
values  actually  found  in  a  small  sample  of  75,  which  was  used 
here  chiefly  for  numerical  illustration. 
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Index 
1.80 


•       •       • 


j7=.006x 


The  above  difficulties  may  be  illustrated  by  an  example  taken 
from  a  study  by  Dr.  W.  R.  Burgess.*  The  predicted  variable 
was  years  of  teacher  training  beyond  high  school,  regression 
equations  for  which  were  formed  with  time  as  the  predicting 
factor.  In  Fig.  69  the  data  and  regression  line  for  one  state 
are  shown.  By  means  of  the  latter,  Dr.  Burgess  predicted  that 
in  1950  the  average  teacher  in  Montana  would  have  1.36  years 
of  training  beyond  high  school. 

It  should  be  observed,  however,  that  the  data  do  not  furnish 
a  linear  trend  for  the  period  studied,  and  the  regression  line 
is  therefore  a  bad  fit. 
Furthermore,  the  forecast 
has  been  obtained  from  a 
ten-year  period  and  has 
been  projected  forty  years 
beyond  the  range  of  ob- 
servation. The  assump- 
tion that  the  educational 
conditions  in  Montana 
from  1920  to  1950  will 
be  comparable  with  those 
from  1910  to  1920  is  un- 
warranted and  the  prediction  is  therefore  probably  worthless. 

The  correlation  between  teacher-preparation  index  and  time 
is  necessarily  small,  thus  giving  a  relatively  large  error  in  esti- 
mate, and  finally,  although  the  total  number  of  observations  is 
large,  the  probable  error  of  a  regression  coefficient  as  small  as 
.006  is  such  as  to  render  its  value  of  doubtful  significance. 

If  predictions  of  the  above  type  are  to  be  made,  the  trend  for 
the  data  studied  must  be  approximately  linear  and  the  projec- 
tion made  only  a  short  time  beyond  the  range  of  observation. 

*  W.  R.  BurRess,  "Trends  of  Teacher  Preparation,"  Journal  of  Educational 
Research,  October,  1921,  p.  181. 
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Fig.  69.   Regression  line  for  Burgess  data 
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5.  Partial  Regression  Equations  for  Four  Variables 

When  four  variables  are  involved,  the  regression  equation 
for  predicting  Xi  from  X2,  X-.i,  and  X4  may  be  obtained  from 
formula  (139)  and  written  in  the  form 

Ai  —  ri2.34  A2  +  ri3.24  ^3  +  ^14.23 A4  +  C 

0'2.34  0'3.24  0"4.23 

The  standard  deviations  are  obtained  from  (141),  giving 


-  (Ti  Vl  -  r^^  Vl  -  rf4.3  ^ 

-^1  —  ^12.34 r     •  '•'" ,  ^2 


"^  Vl  -  rl  Vl  -  r? 


23    V  ■•-        #24.3 


o-iVl-rf2VT-^.2 


+  y^l3.24  — 7'- ,  ^3 

^23    V   1  —  ?'34.2 


<^3Vl-r|,Vl-rr' 


+  ru.23-'^$^^4^^^i^ill;f4+C.  (150) 

"^  Vl  -  rl,  Vl  -  rl,^ 
{Regression  equation  in  four  variables  in  terms  of  partial  correlations} 

In  order  to  calculate  the  regression  coeflicients  for  equation 
(150)  it  is  first  necessary  to  compute  the  required  partial -corre- 
lation coedicients  of  first-order  and  second-order  and  then  sub- 
stitute in  the  above  expression  for  the  regression  coefficients. 
The  value  of  the  constant  term  C  is  then  readily  determined 
from  formula  (146). 

This  procedure  may  be  the  easiest  and  most  direct,  especially 
when  the  partial  correlations  are  needed  for  other  purposes. 
Another  method  will  next  be  presented,  however,  because  stu- 
dents often  find  it  very  convenient.  The  new  formulas  have 
two  advantages  over  (150) :  all  the  operations  involved  are  fully 
expressed,  and  nothing  but  zero-order  correlation  coellicients 
and  standard  deviations  are  required  in  the  calculation.  It  is 
therefore  only  necessary  to  make  straightforward  substitutions 
of  these  values.  Since  the  formulas  for  the  correlation  and  re- 
gression coellicients  involve  the  same  expressions,  we  may  begin 
with  the  former. 
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Returning  first  to  the  general  formula  (135)  for  partial  cor- 
relation, we  may  write 

/ 12.34  = 


Upon  substituting  the  values  for  the  coetlicients  of  first-order 
in  this  expression,  we  find 

ri2(l  -  ri^)  -  risros  -  Ti^tza  +  r34(ri3r24  +  ri^ris) 


ria.34=- 


\/(l -  ^13  - ^14  -  r-h  +  2  ri3^i4r34)(l  -  r.|3  -  ri^  -  r'i^  +  2  r23r24r34; 
S12.34 


V  «Sl34S334 

and,  similarly, 

^13(1  ~  ^'"4)  ~  ^12^23  ~  ''14^34  +  ^24(^12''34  +  ''14^23) 


(151a) 


ri3.24  = 


V(l-''?3-''r4-^34  +  2ri2ri4r24)(l-^l3-^24-''34  +  2r33r24''34) 

=    ^''■''     >  (151b) 

V  51245234 

and 

^14(1  ~  ''2*'3)  ~  ''12^24  ~  ^13''34  +  ''23(^12^34  +  ^13^24) 


ri4.23  = 


V   (1  ~  ''12  ~  ^13  ~~  ''2"'3  +  2  ri2''l3^23)(l  ~  ^"23  ~"  ^2"'4  "  ^3*'4  +  2  r23r2ir^^) 

(151c) 


S14.23 


V  S123S234 
{Second-order  correlation  coefficients  in  terms  of  zero-order  coefficients! 

where  ^'12.34  =  ri2[l  —  r.^^)  —  /'laro^  —  /•n/-_>4  +  /^M^/'ia/'iM  +  /'uj/'-a) 
and  SvM  =  1  —  /'{i}  —  r'l^  —  rr^  -f-  2  ri3/-i i/;m,  etc.,  as  used  in  for- 
mula (147).  It  will  be  noted  that  seven  different  expressions  of 
the  form  indicated  by  5 12.34  and  S134  are  required  for  the  com- 
putation of  the  three  correlation  coefficients. 

Similar  expressions  for  the  partial  regression  coefficients  are 
next  obtained  by  the  use  of  a  general  reduction  formula, 
&12.34     •  n 

_  ^12.34  •  ■  ■  (n-\)  —  rin.34--  (n  -  1)''2  n.34  •  •  •  (n  -  V  0'l.34-  ••  (n  -  1) 


1  -  ^2n.34-(n-l)  C72.34  •  •  •  (n  -  1) 


(152) 


Reduction  formula  for  regression  coefficient! 


Applying  this  formula  and  making  use  of  (141),  we  find 

ri2.3  —  r  14. 3^24.:? 


012.34  —  

(T'2 


1  -  ^*2i3        J  Vi 


vT^T^, 


23 
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and  upon  substituting  the  values  for  first-order  correlations, 
there  results 

^12(1  —  ^34)  ~  ^13^23  —  ^14^24  +  ^34(^13^24  +  ^14^23) 


012.34  —  — 

0-2 

_  CTi  S12.34 
0*2    5234 

and,  similarly. 


1  ~  ''23  ~  ^34  ~  ^34  +  2  ^23^24^34 


(153  a) 


013.24  —  — 


^13(1  ~  ^24)  ~  ''12^23  —  ^14^34  +  ^24(^12^34  +  ^14^23) 


0-3 

O"!  S13.24 


1  —  ^23  ~  ^24  ~  ^34  +  2  ^23^24^34 


and 


^14.23  — 


(73    S234 

^14(1  —  ^"23)  ~  ''12''24  —  ^*13?'34  4-  ^^23  (^12^34  +  ri3r24) 


(153b) 


0-4 

ci  S14.23 


1  —  ^23  "~  ^24  ~  ^34  +  2  "23^24^34 


I 


(153c) 


0*4    S234 

{Second-order  regression  coefficients  in  terms  of  zero-order  coefficients} 

The  advantage  of  these  last  equations  becomes  apparent  from 
the  fact  that  only  four  different  quantities,  512.34  and  ^234,  are 
required  for  the  complete  solution  of  a  given  regression  equation. 
The  constant  term  C  is  of  course  given  by 

C  =  Ml  -  612.34M2  -  613.24M3  -  &14.23M4. 

The  standard  error  of  estimate  may  be  written 


0-1.234  =  0-1  \/(l  -  ril)  (1  -  fils)  U  -  ^u.23) 
by  the  use  of  equation  (141).   Upon  substituting  the  value  for 
7*14.23  from  (151c)  and  expressing  ri3.2  in  terms  of  zero-order 
coefficients,  there  results,  after  simplification. 


0'1.234  =  0*1 


51235234  —  5i4  23 


r  Standard  deviation  of 
-j  third-order  in  terms  of  [    (154) 


(1  —  r23)'^234         t  zero-order  coefficients 
and  by  permuting  the  subscripts, 


0'2.13t  —  CTj 


5i2:{5i.M  —  5.J4.i:t 


'  etc. 
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The  complete  solution  of  the  regression  equation  in  four  vari- 
ables, together  with  the  partial  correlation  coefficients,  is  thus 
accomplished  by  calculating  seven  quantities,  512.34  and  S134, 
based  upon  zero-order  values. 

As  an  illustrative  example  we  may  take  a  four-variable 
problem  worked  out  by  Mr.  J.  W.  Hoge  in  a  term  paper.  The 
problem  was  to  predict  success  in  plane  geometry  from  alge- 
braic ability,  arithmetical  ability,  and  intelligence.  The  group 
consisted  of  fifty  high-school  sophomores. 

A  list  of  the  variables  used  maj^  be  given  as  follows : 

Xi  =  criterion  =  cumulative  score  on  eight  units  of  work  in 
plane  geometry  covering  a  six  months'  period. 

X2  =  the  cumulative  score  on  three  algebra  tests  covering 
the  four  fundamental  operations  and  the  solution  of  linear  and 
quadratic  equations. 

X3  =  score  on  the  Reavis-Breslich  arithmetic  test. 

Xa  =  intelligence  quotient  on  the  Otis  Self-Administering  test. 

The  zero-order  correlation  coefficients,  standard  deviations, 
and  means  are  given  in  the  following  table : 

Table  80.   Data  from  Mr.  Hoge's  Paper 


Zero-Order 
Correlations 

Standard 
Deviations 

Means 

Partial 

Correlations  for 

Subsequent  Use 

ri2  =  .54 

(Ti  =  35.5 

Ml  =  224.4 

ri3  =  .49 

a-  =  6.87 

A/o  =  41.32 

ri3.2    =  .258 

r,4  =  .41 

0-3  =  21.28 

3/3  =  81.52 

ri4.23  =  .234 

r23  =  .58 

(74  =  8.49 

Mi  =  113.88 

r24  =  .29 

r34  =  .50 

Returning  to  equations  (153),  we  shall  first  work  out  the 
regression  coefficients  for  the  equation 

Xi  =  bio.s-iXo  -\-  bis.o^Xs  -\-  614.23-Y4  -h  C. 

Upon  substituting  the  zero-order  correlation  coefficients  we  find 

Si2.34  =  .192,   S13.24  =  .0779,   S14.23  =  .1095,   and   S234  =  .498. 


304  STATISTICAL  METHODS  IN  EDUCATION 

The  values  for  the  regression  coefficients  may  then  be  written 

,  35.5  ^  .192      .  QQ    .  35.5  ^  .0779       ^^, 

612.34  =  e;87  X  49g  =  1.99,  613.24  =  21:28  >^  :498  ==  '^^^' 

and  6h:23  =  III  X  ^  =  .919. 

The  equation  for  the  constant  term  C  then  gives 

C  =  224.4  -  1.99  X  41.32  -  .261  X  81.52  -  .919  x  113.88  =  16.2, 

and  the  complete  regression  equation  is  thus  written, 

Xi  =  1.99  X2  +  .261  Xs  +  .919  X4  +  16.2. 

In  order  to  obtain  the  standard  error  of  estimate  (71.2:34,  the 
quantity  Si  23  is  required,  and  the  latter  will  be  worked  out  in 
computing  the  partial  correlation  as  follows : 

Si23  =  .439,     S124  =  .585,     5i34  =  .543. 
Substituting  the  necessary  values  in  equations  (151),  we  find 

ri2.34  =  —7===  =  -369,      ri3.24  =       .     '  ==  =  .144, 

V.543  X  .498  V.585  x  .498 

.1095 
and  rM.23  =  v.439  X  .498  =  '^^'^' 

The  value  for  0-1.234  from  formula  (154)  becomes 


r.^^     /.218622  -  .011990      ^^  , 
^^•^\       .6636  X. 498       '  ^^'^' 

so  that  the  probable  error  of  estimate  is  .6745  x  28.1  =  19.0. 

This  large  error  of  estimate  is  due  to  the  wide  range  (134  276) 
and  large  standard  deviation  (35.5)  of  the  cumulative  geometry 
scores  as  well  as  to  the  low  intercorrelation  of  the  tests. 

By  dropping  intelligence  as  a  predictor  the  regression  equa- 
tion becomes  Xi  =  1.99  X2  +  .444  X3  +  106.0  ±  19.5,  with  only 
a  slightly  larger  error  of  estimate.  The  estimate  from  three 
variables  is  thus  more  reliable  than  the  estimate  from  two 
variables,  but  on  account  of  the  small  difference  it  is  hardly 
worth  while  using  more  than  the  two  predicting  variables  in 
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such  a  problem.  Correlations  of  the  order  .5  between  criterion 
and  predictor  are  usually  necessary-  before  additional  variables 
increase  the  reliabilit}^  of  the  estimate  to  any  appreciable  extent. 
In  order  to  assist  the  student  in  working  out  any  regression 
coefficients  by  the  above  method  with  four  variables,  a  com- 
plete set  of  values  is  given  in  Table  81. 

Table  81.   Regression  Coefficients  of  Second-Order  Expressed 
IN  Terms  of  Zero-Order  Coefficients 

,    ^       _  CTi  rri2t.l  —  r34-)  —  risras  —  riAro*  +  r34(ri3r24  +  ri4r23'']  _  (Ti  Si2.34 
^■2  L  1  —  r23-  —  r24-  —  r34-  +  2  r23r24r34  J       0'2    S234 

,  CTi  rri3(l  —  r24~)  —  ri2r23  —  rurzi  +  r24(ri2r34  +  ri4r23M       (Tx  S13.24 

013.24  =  —-     :; ^ :; .,     ,     ^ = o 

<T3  L  1  —  r23-  —  r24-  —  r34-  +  2  r23r24r34  J       Cs    0234 


14.23 


h: 


4.13 


,  CTi  rri4(l  —  r23-)  —  ri2r24  —  ri3r34  +  r23(ri2r34  +  ri3r24)l 

(Til  1  —  r23 r24-  —  r34-  +  2  r23r24r34  J       (74    0234 

,  _  (T2  rri2(l  —  r34-)  —  r23ri3  —  r24ri4  +  r34(r23ri4  4-  r24ri3)"l  _  (T2  S12.34 

(Til  1  —  ri3-  —  ri4-  —  r34-  +  2  ri3ri4r34  J       (J\    S134 

,  _  ^2  rr23(l  —  ri4-)  —  ri2ri3  —  r24r34  +  ri4(ri2r34  +  r24ri3'>"|  _  <J2  S23.14 

0'3  L  1  —  ri3-  —  ri4-  —  r34-  +  2  ri3ri4r34  J       CTs    S134 

0*2  ['"24(1  —  ri3-^  —  ri2ri4  —  r23r34  +  ri3(ri2r34  +  r23ri4)'|  _  <T2  S24.13 
(T\\.  1  —  ri3-  —  ri4-  —  r34-  +  2  ri3ri4r34  J  ~  (Ta    S134 

,  (Ts  rri3(l  —  r24-)  —  r23ri2  —  r34ri4  +  r24(r23ri4  +  r34ri2)l       0*3  S13.24 

O31   ^4  =:  —      ^ 

(Ti  L  1  —  ri2-  —  ri4-  —  r24-  +  2  ri2ri4r24  -I      (7i    S124 

,  (Ji  rr23(l  —  ri4-)  —  ri3ri2  —  r34r24  +  ri4(ri3r24  +  r34ri2)l      (Tz  S23.14 

(72  L  1  —  ri2-  —  ri4'  —  r24-  +  2  ri2ri4r24  J       (72   S124 

,  _  ^3  ["^34(1  —  ri2-)  —  ri3ri4  —  r23r24  +  ri2(ri3r24  4-  r23ri4^1  _  (73  S34.12 

*  "      (74  L  1  —  ri2^  —  ri4-  —  r24-  +  2  ri2ri4r24  J 

,         (74  ['''^(l  ~  ^23-)  —  r24ri2  —  r34ri3  +  r23(r24ri3  +  r34ri2)'|    (74  i>14. 

041.23  =  —    ^ ^ zi .,    .   n = ^ — 

(7i  I  1  —  ri2-  —  ri3-  —  r23-  +  2  ri2ri3r23  J       (7i    S12 


(74     Sl24 

(1  —  r23-)  —  r24ri2  —  r34ri3  +  r23(r24ri3  +  r34ri2)'|  _  (74  Su. 23 

3 
,  (74  rr24(l  —  ri3-)  —  ri4ri2  —  r34r23  +  ri3(ri4r23  +  r34ri2'tl       (74  S24.13 

04  2.13  -—  "   "  I  ~~ — ' — ^ I  ^  ^       

(72  L  1  —  ri2^  —  T\z^  —  r23^  +  2  riarisras  J      (72   S123 

,        ^  _^*  rr34(l  —  ri2^)  —  ri4ri3  —  r24r2a  +  ri2(ri4r23  4-  r24ri3)]  _  0^4  S34.12 
(7.1  L  1  —  ri2-  —  ri3-  —  r23^  +  2  ri2ri3r23  J      (73   S123 


Another  example  of  four-variable  regression  is  furnished  by 
Burt's  data  in  section  2.  The  equation  for  estimating  the  Binet 
score  from  the  remaining  scores  is  given  by  Burt*  in  the  form 
Binet  =  .54  school  work  +  .33  intelligence  (reasoning)  +  .11  age, 
the  variables  being  taken  from  the  mean  of  the  whole  set  and 

♦  Op.  cit.  p.  183. 
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all  expressed  in  age  units.  A  year's  increase  in  educational  age 
is  therefore  accompanied  on  the  average  by  .54  of  a  year  of  in- 
crease in  mental  age,  and  a  year's  increase  in  intelligence  by 
.33  of  a  year  in  mental  age,  etc.  Since  all  the  variables  are  in 
the  same  unit  and  the  total  of  the  coefficients  happens  to  be 
almost  unity,  Burt  makes  the  following  interpretation :  ''Of  the 
gross  result,  then,  one  ninth  is  attributable  to  age,  one  third  to 
intellectual  development  and  over  half  to  school  attainment, 
...  or  in  determining  the  child's  performance  on  the  Binet- 
Simon  scale,  intelligence  can  bestow  but  little  more  than  half 
the  share  of  school,  and  age  but  one  third  the  share  of  intelli- 
gence" (op.  cit.  p.  183). 

These  results  have  been  seized  upon  by  the  antagonists  of 
intelligence  tests  as  showing  that  the  Binet  scale  measures 
chiefly  school  work  and  not  intelligence,  as  already  noted  in 
section  2;  but  the  difficulties  involved  in  such  interpretation 
become  apparent  when  other  equations  such  as  that  for  pre- 
dicting age  are  given.  Thus,  age  =  .15  Binet  +  .51  school 
work  +  .03  intelligence.*  Are  we  to  conclude  from  this  result 
that  over  half  a  child's  age  is  ''attributable"  to  school  work, 
one  sixth  to  Binet,  and  only  a  small  fraction  to  intelligence? 
Such  a  conclusion  is  absurd,  but  it  is  logically  as  sound  as 
Burt's  inference  regarding  his  equation. 

Regression  coefficients  of  any  order  merely  show  the  average 
change  in  the  dependent  variable  for  a  unit  change  in  the  inde- 
pendent variable  to  which  they  are  attached,  the  remaining 
variables  being  constant.  If  these  coefficients  are  obtained  for 
a  set  of  variables  all  in  the  same  units  the  relative  value  of 
the  several  predictors  may  be  compared  as  they  affect  the  esti- 
mate. Thus  "school  work"  is  five  thirds  as  valuable  as  "intelli- 
gence" in  forecasting  mental  age,  and  five  times  as  valuable  as 

*  For  the  derivation  of  this  equation  and  other  critical  comment  see  Holzinpjer 
and  Freeman,  "The  Interpretation  of  Burt's  Regression  Equation."  Journal  of  Edu- 
cational Psychologij,  December,  1925.  For  further  discussion  see  also  G.  H.  Thomson, 
"The  Interpretation  of  Burt's  Regression  Ecjuation."  and  HolzinRer  and  Freeman, 
"Rejoinder,"  Journal  of  Educational  Psi/chology,  May  and  September,  1926. 
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chronological  age  used  for  the  same  purpose.  This  is  quite 
a  different  interpretation,  however,  from  Burt's,  in  which 
regression  coefficients  are  regarded  as  representing  the  parts  of 
the  independent  variables  going  to  make  up  the  total  of  the 
dependent  variable. 

It  must  be  kept  in  mind  that  ordinarily  only  a  few  of  the 
possible  predicting  variables  are  used  in  an  estimate ;  also,  that 
the  regression  coefficients  will  change  whenever  a  new  variable 
is  added,  provided  the  partial  correlations  involved  are  not 
zero.  This  may  be  illustrated  by  beginning  with  the  equation 
for  Binet  scores  on  school  work,  which  is  approximately  of  the 
form :  Binet  =  .9  school  work.  According  to  Burt's  reasoning, 
nine  tenths  of  Binet  will  then  be  ''attributed  to  school  work." 
The  addition  of  new  variables,  however,  will  reduce  this  share 
to  .7,  then  to  .54,  and  much  lower  if  enough  predictors  are 
taken  which  have  some  partial  correlation  with  Binet. 

6.  Multiple  Correlation 

The  correlation  between  a  set  of  observed  values  such  as 
given  by  Xi  and  the  predicted  values  from  the  regression 
equation 

X\  =  hi2.3^"-nX2  +  bi:i.24---nXs  +  *  •  •  +  &ln.23---  (n-l)Xn  +  Cl 

is  known  as  multiple  correlation  and  is  denoted  by  jRi(234---n). 
It  may  be  shown*  that 

P,,«„         V  — y     -   —        1       ^1.23-  -n     /  Multiple-correlation  1     /iccx 
Ru23  .  ■    „)  -  r^, J,,  -  ^  1  -  —-^ ,  J  coefficient  /    (^  ^5) 

but  a  more  convenient  form  for  calculation  is  given  by 

l--Rf,23...n,  =  (l-'-l'2)(l-^li2)(l-'-il23)---(l-rf„.23-(n-l,).    (156) 


(Computation  form  for  R} 


which  follows  at  once  from  equation  (141). 

*  Yule,  op.  cit.  p.  248. 
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One  use  of  multiple  correlation  is  in  showing  how  closely  Xi 
can  be  expressed  as  a  linear  function  of  X2,  X3  •  -  -  Xn.  If  Xi 
coincides  with  the  predicted  Xi  for  all  the  observations,  the 
standard  error  of  estimate  becomes  zero  and  -Ri(23--n)  by 
formula  (155)  will  equal  unity.  If,  on  the  other  hand,  the 
residuals  Xi  —  Zi  are  so  large  that  their  standard  deviation 
0-1.23 '••n  approaches  cri,  the  value  of  i2i(23---n)  will  approach 
zero.  The  multiple-correlation  coefficient  thus  gives  an  alter- 
native method  for  determining  the  reliability  of  an  estimate 
from  a  regression  equation. 

In  order  to  illustrate  the  use  of  these  formulas,  we  may  return 
to  the  data  given  in  section  5  for  predicting  success  in  geom- 
etry. Considering  that  Xi  is  to  be  estimated  from  X2  and  X3, 
we  may  substitute  ru  =  .54  and  ri3.2  =  .258  in  equation  (156), 
giving 

1  -  R?i2S)  -  [1  -  (.54)2] [1  _  (.258)-^]. 

The  calculation  is  very  easily  done  with  the  aid  of  Holzinger's 

Table  VI,  thus : 

log[l-    (.54)2]  =  9.85028 

log  [1  -  (.258)2]  =  9.97008 
log  [1  -  iei%3)]  =  9.82036 
.'.  ^1(23)  =  .582. 

It  is  only  necessary  to  add  the  logarithms  and  look  up  the  value 
for  R  corresponding  to  their  sum  ;  for  example,  for  the  logarithm 
9.82038  in  Table  VI  we  obtain  R  =  .582,  the  answer  being  correct 
to  three  places. 

In  estimating  Xi  from  X2,  X3,  and  X4  the  equation  will  be 
1  -  ^U234)  =  (1  -  ^1^2)  (1  -  ri3,2)  (1  -  ^H.23).  The  necessary  arith- 
metic is  therefore 

log  [1  -    (.54)2]  =  9.85028 

log[l-  (.258)2]  =  9.97008 

log[l-  (.234)2]  =  9.97554 

log  [1  -  /e2.3, J  =  9.79590 

.'.   Rl{234)  =  .612. 
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When  (71.234  has  already  been  computed  by  formula  (154), 
it  is  of  course  necessary  only  to  substitute  this  result  in  for- 
mula (155). 

The  regression  coefficients  are  the  best  possible  weights  which 
can  be  assigned  to  the  variables  X2,  X3  ■  •  -  Xn  in  making  a 
linear  prediction  for  Xi.  The  multiple-correlation  coefficient, 
therefore,  gives  a  useful  measure  of  the  correlation  which  can 
be  expected  from  pooling  the  predictive  tests  in  the  form  of 
a  regression  equation.  Thus  in  the  above  example  the  coeffi- 
cients .582  and  .612  measure  the  reliability  of  estimates  from 
pooling  two  and  three  predictors  in  the  best  linear  form. 
The  gain  in  reliability  is  very  slight,  however,  when  a  third 
variable  is  added,  a  conclusion  which  was  reached  also  by  com- 
paring the  probable  errors  of  estimate,  19.5  and  19.0. 

An  interesting  application  of  the  method  of  multiple  correla- 
tion is  given  in  the  volume  on  Psychological  Testing  in  the 
United  States  Army,*  where  the  possibility  of  increasing  the 
correlation  between  the  Beta  scale  and  the  Stanford-Binet 
test  is  determined.  The  necessary  zero-order  coefficients  are 
given  in  the  following  table. 


Table  82.   Correlations  of  Beta  Tests  with  Stanford-Binet 
Mental  Age  and  with  Each  Other  (653  Cases) 


Beta  Tests 

Test 

1 

2 

3 

4 

5 

6 

7 

8 

Stanford-Binet 

.465 

.545 

.614 

.639 

.622 

.586 

.610 

.572 

Beta  Tests 

1.  Maze 

.477 

.522 
.632 

.514 
.576 

.457 
.560 

.490 
.556 

.510 
.592 

.476 

2.  Cube 

.551 

3.  X-0  series      

.689 

.670 

.584 

.597 

.619 

4.  Digit  symbol      

.766 

.654 

.584 

.695 

5.  Number  check 

.619 

.521 

.703 

6.  Picture 

.555 

.569 

7.  Geometrical 

.559 

8.  Spot  pattern 

♦  Memoirs  of  the  National  Academy  of  Sciences,  Vol.  XV  (1921),  p.  387. 
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Upon  applying  the  multiple-correlation  formula  (155)  it  was 
found  that  i2s(  12345078)  =  .731,  which  is  the  highest  corre- 
lation obtainable  between  Stanford-Binet  and  the  best  linear 
weighting  of  the  eight  Beta  tests.  The  correlation  between  the 
unweighted  pool  of  these  eight  tests  and  the  Binet  test  was  .728, 
showing  that  very  slight  improvement  is  made  by  weighting 
such  components. 

The  writers  of  the  above  report  then  decided  to  eliminate 
certain  tests  as  suggested  by  the  results  from  the  partial  cor- 
relations and  thus  obtain  a  shorter  and  possibly  as  good  a  test 
with  unweighted  items  as  with  the  whole  battery  weighted  or 
unweighted.  By  empirical  trial  they  found :  (1)  elimination  of 
test  8,  r  (Stanford  X  Beta)  =  .726;  (2)  elimination  of  tests  8 
and  2,  r (Stanford  X  Beta)  =  .723 ;  (3)  elimination  of  tests  8, 
2,  and  1,  r (Stanford  X  Beta)  =  .723.  Thus  the  simple  pool  of 
five  of  the  Beta  tests  gave  almost  as  good  results  as  the  best 
weighting  of  all  eight. 

The  final  form  suggested  was  to  use  a  non- weigh  ted  pool  of 
six  of  the  Beta  tests,  dropping  test  8  and  giving  test  1  one  half 
the  weight  of  the  rest.  The  correlation  for  this  last  result  with 
the  Binet  scale  was  .727,  which  is  only  slightly  less  than  the 
best  value,  .731. 

Some  important  properties  of  multiple  correlation  may  next 
be  shown  by  returning  to  equation  (156).  It  is  apparent  that 
every  parenthesis  on  the  right  is  smaller  than  unity,  provided 
none  of  the  partial  correlations  be  equal  to  zero.   Hence 

1  ~  -^1(23  .  • -n)    ^    I  ~  ^12> 
I  ~  -^1(23  ■  ■  ■  n)     <    1  ~  ^13. 2» 

and  1  -  Rf^23  •  •  • «)  <  1  -  ^li.23.  etc. 

Similarly,  1  -  7?i%2 . . .  „)  <  1  -  r^j, 

and  1  —  Ri(io . . .  „)  <  1  —  r'^^,  etc. 

The  multiple-correlation  coefiicient  R  cannot,  therefore,  be 
smaller  than  any  partial  coefficient  of  zero  or  of  a  higher  order, 
and  it  is  usually  considerably  larger. 
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If  the  coefficients  ri2,  ris  •  •  -  rin  are  all  equal  and  are  denoted 
by  rij-,  and  if  the  coefficients  r23,  r24  •  •  •  r(n-i)7i  are  also  equal 
and  are  denoted  by  r-cx,  it  also  follows  from  (155)  that 

j  ^^  r  Multiple-correlation  "1 

Rim . . .  n)  =  rix  \ hr-7-7 T^^r-  •    1  coefficient  ^or  equal  \    (157) 

\l-\-{n-2)rxx       [         coefficients         J 

In  case  C  is  to  be  predicted  from  n  other  variables,  n  —  1  may 
be  replaced  by  n  in  formula  (157),  giving 


i?c(i23 . . .  n,  =  r„  >J  !+(„  "  1),^^ '  (1 58) 

which  is  the  same  result  as  that  obtained  in  equation  (51)  of 
Chapter  IX. 

If  the  numerator  and  denominator  under  the  radical  of  equa- 
tion (158)  be  divided  by  n,  and  then  n  be  allowed  to  approach 
infinity,  we  find  that 

r>  _!_    ^cx  f  Limiting  value  for  "1     /1  en\ 

/^c(i23  • .  ■  00)  —      . —  •     ^  ,iroN     I  .       r    (159) 

Wy  t  (158)  when  n-yoo  j     ^        ^ 


These  last  two  equations  are  useful  in  estimating  the  limits 
for  prediction.  Suppose,  for  example,  that  there  are  50  unre- 
lated environmental  conditions,  each  correlated  to  the  extent  of 
.05  with  human  physical  traits  {r^x  =  0,  and  Tcx  =  .05).  Upon 
substituting  in  (158),  we  find  R  =  .05\/50  =  .35.  In  actual  prac- 
tice, however,  there  is  a  correlation  of  about  .5  between  such 
environmental  conditions,  so  that  by  using  (159)  we  find  R^.07 ; 
that  is,  an  infinity  of  such  conditions  increase  the  correlation 
from  .05  to  only  .07. 

The  best  results  are  of  course  obtained  by  seeking  predic- 
tors which  correlate  high  with  the  criterion  and  low  amongst 
themselves.  Thus,  if  Tcx  =  .6,  Txx  =  .4,  and  n  =  10,  we  find,  from 
equation  (158),  that  R  =  .88.  Arbitrary  values  may  be  substi- 
tuted in  this  formula,  giving  a  result  greater  than  unity;  but, 
from  the  constitution  of  the  whole  set  of  variables,  this  can- 
not occur  in  actual  practice. 
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7.  Solution  by  Determinants 

While  the  methods  of  calculation  shown  thus  far  are  probably 
as  convenient  as  any  up  to  four  variables,  another  procedure  will 
next  be  given  in  which  determinants  are  employed.  The  student 
who  is  familiar  with  the  theory  of  determinants  and  who  has  the 
use  of  a  calculating  machine  may  find  this  method  fairly  rapid. 

The  chief  function  used  is  the  determinant  of  all  the  zero- 
order  correlations  given  by 


A  = 


rn 

r2i 

Hi       ' 

"       Tnl 

ri2 

T22 

T32       ' 

'     r„2 

ri3 

T2S 

^33       '  ' 

•       Tn3 

Tin 

r2n 

TSn      •  ' 

^nn 

f  Determinant "] 
•  -l  of  zero-order  )-    (160) 
L  coefficients  j 


A  minor  such  as  A 12  is  obtained  by  striking  out  all  the  coeffi- 
cients in  the  row  and  column  common  to  ri2.  A  cofactor,  A,>,  is 
equal  to  the  minor  A^  with  the  sign  that  would  be  attached  in 
expanding  the  determinant.   Thus  the  three-rowed  determinant 

^11      ?'21      ?'31 

A  =   ^12       ^22       ^32 

''13       ^23       ''33 

may  be  written  A  =  rnAu  —  ri2Ai2  +  risAis 
or  A  =  riiAii  +  ri2A]2  +  VisAis 


J  Determinant  for  "1     /i  /^i  \ 
three  variables  J     ^        ^ 


L 


=  rii 


r*22r32 
r23r33 


r23r33 


+  r 


13 


r2ir3i 
r22r32 


+  ri2(-l) 

=  rii(r22r33  -  ^2^3)  +^i2(ri3r23  -  r  12^33) 

+  ri3(ri2r23  — ri3r22). 
Simplifying  this  last  expression,  we  find  that 

A_i        y2        „2        r2iOrrr         /     Expanded     \ 
A  -  1  -  ri2  -  ri3  -  r23  +  2  ri2ri3r23,    { ^.^^^^^  of  (161 )  i 

which,  of  course,  is  the  same  as  5 123  of  section  5. 


(162) 
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With  a  similar  notation  Professor  Pearson*  has  shown  that 

—  Ai2 


J'12.34  .  .  .  n  — 


Tl  ft.34  '"k 


n  — 


VA11A22 

—  Aik 
VAiiA 


kk 


R 


1(23 


"'^V^-a;!' 


1.23 


??12. 


=  0-1  Vl  -  /?2 

—  ^12  0"1 


0-1 


V^" 


34  . . .  n  — 


hi 


k.ZZ  '"k 


All    ^2 

—  ^Ift  CTi 


(163) 
(164) 

(165) 

(166) 
(167) 

(168) 


All   ^k 

These  formulas  will  next  be  illustrated  by  the  problem  in  pre- 
dicting geometrical  success.  Arranging  the  zero-order  coeffi- 
cients from  Table  80  in  the  form  of  a  determinant,  we  have 

1  .54  .49  .41 

.54  1  .58  .29 

.49  .58  1  .50 

.41  .29  .50  1 


A  = 


This  may  be  worked  out  by  reducing  to  a  determinant  of  lower 
order. 

Multiplying  each  row  by  the  reciprocals  of  the  items  in  the 
first  columns,  we  have 


Reciprocal  of 
Column  1 

Column  1 

1 

1 

0.54 

0.49 

0.41 

X  1.000 

1.852 
2.041 

A  = 

1 
1 

1.852 
1.184 

1.074 
2.041 

0.537 
1.020 

X    .54 
X    .49 

2.439 

1 

0.707 

1.220 

2.439 

X    .41 

(If  all  the  elements  of  a  row  (or  column)  are  multiplied  by  the 
same  number  n,  the  determinant  is  multiplied  by  n.) 


*  Unpublished  lecture  notes. 
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Next,  subtract  the  elements  of  the  first  row  from  those  of  each 
of  the  other  three  rows  (this  leaves  the  value  of  A  unchanged). 


A  = 


X  .1085  = 


1.312  0.584  0.127 
0.644  1.551  0.610 
0.167  0.730  2.029 


X  .1085 


1  0.54     0.49     0.41 

0  1.312  0.584  0.127 

0  0.644  1.551  0.610 

0  0.167  0.730  2.029 
=  .1085[1.312(1.551  X  2.029  -  .73  X  .61) 
-  .644(.584  X  2.029  -  .127  X  .73) 
+  .167(.584  X  .61  -  .127  x  1.551)]  =  .3112. 

The  determinant  can  of  course  be  reduced  to  two  rows  before 
expanding,  but  the  arithmetic  from  the  three-rowed  value  above 
is  very  rapid  on  a  machine. 

The  other  determinants  required  may  be  worked  out  in  a 
similar  way,  that  is, 


Aii  = 


1 

.58 

.29 

.58 

1 

.50 

.29 

.50 

1 

=  +  .498,         Ai2  = 


.54 

.49   .41 

.58 

1     .50 

.29 

.50     1 

=  +  .192. 


Also,  A22  =  +  .543,  A33  =  +  .585,  A44  =  +  .439,  A13  =  -  .0779, 
and  Ai4  =  +  .1095,  so  that  A12  =  -  .192,  A13  =  -  .0779,  and 
Ai4  =  -  .1095. 
Substituting  these  values  in  formulas  (163)  to  (168),  we  find 

+  .192 


/•12.34 


^13.24 


^14.23  = 


V.498  X  .543 
+  .0779 

V.498  X  .585 
+  .1095 


V.498  X  .439 


-  =  .369, 
=  .144, 
=  .234, 


R 


1(234) 


=V^ 


.3112 

.498 


=  .612, 


and 


0-1.234  =  35.5  aJ^^^c^  =  28.1. 

P.E. 1.234  =  .6745  0-1.234  =  19.0. 
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Also,  &12.34  =      ^gg     g-^  =  1.99, 

,  4-  .0779  35.5        0^1 

^''-''  ^  T498"  21:28  =  '^^^' 

,  ,  +  .1095  X  35.5       Q,Q 

and  bu.2s  =     .498  ^  SA9     =  '^^^' 

It  should  be  noted  that  the  above  results  agree  with  those 
found  in  section  5,  since  A12  =  5i2.34,  A13  =  —  iSi3.24,  A14  =  S14.23, 
All  =  -8^234,  A22  =  'S134,  A33  =  S124,  and  A44  =  ^123. 

When  more  than  four  variables  are  involved,  it  is  probably 
best  to  use  reduction  formulas  of  the  type 

ri2.34  -  r15.34r25.34  r  Partial  coro 

ri2.345  =  — ==        ^  ==      ^relation    of  I    (169) 

V  [1  —  ri5,34j[l  —  r25.34j       I  third-Order  J 


and     bi2.345  =  ^'''l'  ~  '•'f  ^'•f  •^^>  gj X^^  ^^  -  '•'4.3,    (170) 

(1  -  tIm)        <^2  Vl  -  r|3  Vl  -  r|,.3 

{Regression  coefficient  of  third-order} 

and  carry  out  the  arithmetic  on  a  calculating  machine  with  the 
aid  of  Miner's  Tables.  The  computation  is  not  only  easier  than 
by  determinants,  but  the  checks  ri2.34  =  ^12.43  etc.  already 
noted  can  be  conveniently  made.  A  good  example  of  a  corre- 
lation problem  in  five  variables  is  given  in  Pearl's  ''Medical 
Statistics  and  Biometry,"  p.  329,  while  other  methods  of 
calculation  may  be  found  in  Kelley's  ''Statistical  Method," 
chap.  xi. 

EXERCISES 

1.  Data:  113  pupils  (67  boys  and  46  girls).  Variables,  (1)  age, 
(2)  weight,  (3)  standing  height,  (4)  sitting  height.  Correlations, 
ri2  =  .75,  ri3  =  .85,  r^  =  .79,  r23  =  .89,  r24  =  .90,  m  =  .94. 

Work  out  the  partial  correlations  of  the  second-order. 
(ri2.34  =  -  .007,  ri3.24  =  .50,  ri4.23  =  -  .04, 

^23. 14  =  .26,  r24.13  =  .41,  ^34. 12  =  .63. 

Ans.) 
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2.  Calculate  the  first-order  and  second-order  partial-correlation 
coefficients  from  the  following  data  : 
ri2  =  .78,     ri3  =  .4o,     ri4  =  .40,     r23  =  .48,     r24  =  .29,     r34  =  .52. 
ri2.3  =  .720l„        _  rro       r23.i=      .2311  „        _       ^, 


20  \^         _  ..        r23.i=      .231  \^ 

^r^   >ri2.34  —  'iO  .^„    ^^23.14 

D(  J  r23.4  =      .403  j 


>Ans. 


^12.4  =  .tO(  J 

ri3.2  =  .1381^        _   ..        r24.i  =  -  .038 \ 
ri3.4  =  .309r^^-^^--^^       r24.3=      .054 1 

ri4.2  =  .291\^  _    r,n         r34.i=        .415) 

ri4.3  =  .218j  r34.2  =      .4o4j  j 

3.  Given  :   no  =  -  .481,  ri3  =  -  .697,  r^  =  -  .494, 

r23  =  -h  .374,  r24  =  +  .363,  r34  =  +  .286, 

C7i  =  34.48,  (72  =  2.89,  0-3  =  2.58,  0-4  =  2.79, 

Ml  =  99.94,         M2  =  73.54,         M3  =  78.23,         AU  =  77.39. 

Verify  the  following  results : 

Xi  =  1093  -  2.09  X2  -  7.40  X3  -  3.36  X4, 

0-1.234  =  21.69,       i2i(234)  =  .778. 

4.  The   following   regression   equation   was   obtained   by  F.  L. 
Whitney  {Journal  of  Educational  Research,  May,  1923) : 

Xi  =  23.218  +  .004X2  -  .038X3  -  .115X4  +  .915X5  +  1.403X6 
-.085X7  ±3.02. 

Predict  the  teaching  success  of  a  student  with  the  following  records : 

Xi  (to  be  predicted)  =  score  on  a  rating  scale. 

X2  =  80  =  intelligence  score, 

X3  =  89.4  =  high-school  academic  record, 

X4  =  8.7  =  normal-school  academic  record, 

X5  =  8.5  =  normal-school  professional  record, 

Xe  =  8.6  =  student-teaching  record, 

X7  =  9.0  =  measure  of  physique. 

(Xi  =38.2  ±3.02.   Ans.) 

Interpret  the  regression  coefficients.    Do  good  academic  work  and 
good  physique  interfere  with  good  teaching? 


5.  Derive  the  formula  /?i(23)  =  -x   ~ ~ "., 

6.  Derive  formulas  (151a),  (151b),  and  (151c). 

7.  Derive  formulas  (157),  (158),  and  (159). 


CHAPTER  XVI 

THE  ELEMENTS  OF  CURVE-FITTING 

1. Introductory 

The  investigator  in  many  fields  of  science  is  frequently  inter- 
ested to  determine  the  mathematical  curve  underlying  his  data. 
Such  a  curv'e  is  not  only  desirable  in  furnishing  the  theoretical 
law  to  which  the  obser\'ations  conform,  but  is  also  of  practical 
value  as  a  basis  for  estimation.  In  the  fields  of  education  and 
psychology  examples  are  furnished  by  learning  curves,  physical 
and  mental  growth  curves,  and  frequency  distributions.  It  is 
important  to  know  the  general  laws  of  mental  growth  as  well 
as  to  predict  the  standing  of  individuals  of  a  given  group,  and 
for  such  pun^oses  it  is  usually  necessary  to  fit  the  data  with  a 
curve  whose  constants  depend  upon  the  observations.  The 
plot  of  the  experimental  data  often  suggests  some  mathemati- 
cal function  which  will  be  a  good  approximation  to  the  observed 
material,  allowing  for  the  minor  fluctuations  in  siimpling.  The 
problem  is  then  to  select  the  type  of  curve  which  is  to  be  fitted 
to  the  data  and  to  obtain  its  equation  by  appropriate  methods. 
The  suitability  of  the  curve  selected  may  finally  be  determined 
by  tests  for  goodness  of  fit. 

The  choice  of  the  proper  sort  of  mathematical  function  will 
depend  a  great  deal  upon  the  worker's  experience  in  curve- 
fitting  and  the  accuracy  of  fit  required.  It  is  a  well-known 
fact  that  by  putting  as  many  constants  into  the  equation  as 
there  are  obser\'ations  the  resulting  curve  will  pass  through  all 
the  observed  points.  If  this  is  done,  however,  an  extremely 
complicated  function  will  result  and  the  minor  fluctuations, 
which  should  be  smoothed  out.  will  be  given  undue  emphasis. 
It  is  therefore  better  to  use  a  simple  function  involving  only  a 
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few  constants,  securing  in  this  way  a  smoothing  or  graduation 
of  the  data  which  allows  for  the  small  fluctuations  of  sampling. 
In  the  present  chapter  we  shall  introduce  several  of  these 
simple  curves  and  show  how  they  may  be  fitted  to  the  observed 
data.  The  observations  to  be  fitted  may  consist  of  a  series  of 
points  resulting  from  two  measured  characters  such  as  the 
amount  learned  in  a  giv^  time,  or  they  may  be  given  In  the 
form  of  a  frequency  distribution.  Three  types  of  curves  will  be 
presented  for  fitting  data  of  the  first  sort,  while  the  normal 
probability  curve  will  be  used  to  illustrate  the  method  of  grad- 
uating frequency  distributions.  It  should  be  noted  that  these 
curves  have  been  selected  from  a  very  large  number  available 
because  they  have  been  found  to  give  good  results  with  certain 
data.  They  are  presented  here  chiefly  for  illustration  of  the 
methods  of  fitting. 

2.  Types  of  Curves 

In  dealing  with  growth  and  learning  data  one  of  the  most 
useful  functions  is  the  hyperbola,  which  for  the  purpose  of  curve- 
fitting  may  be  most  conveniently  written  in  the  form 

Y=  — ^-—  +  C.  {Hyperbola}         (171) 

a  +  bX 

The  constants  a,  b,  and  c  are  to  be  determined  from  the  obser- 
vations. The  use  of  this  curve  will  be  illustrated  in  applying 
the  method  of  averages  in  section  5. 

Another  curve  which  has  been  found  to  give  a  good  approx- 
imation to  growth  data  is  the  logarithmic  growth  function, 

Y=  a  -\-  bX  -i-  C  log  X.     {Logarithmic  growth  function  ■     (172) 

This  curve  is  similar  in  appearance  to  (171)  and  will  be  shown 
to  give  approximately  as  good  results  with  certain  data.  The 
introduction  of  the  terms  a  -\-  bX  has  the  efl'ect  of  raising  and 
stretching  out  horizontally  the  ordinary  logarithmic  curve, 
Y  =  c  log  X 
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A  third  and  very  useful  function  is  the  nth-order  parabola, 

Y=Co-}-  CiX-\-  C2X^ -{- CsX^ -\ h  C„A'",  {wth-order  parabola}  (173) 

where  the  Cs  are  constants  determined  by  the  data.  If 
C2  =  C3  =  •  •  •  =  C„  =  0,  this  expression  reduces  to  the  equation  of 
a  straight  Hne ;  if  C3  =  C4  =  •••  =  Cn  =  0,  an  ordinary  parabola 
results ;  while  if  C4  =  Co  =  •  •  •  =  Cn  =  0,SL  cubic  is  obtained,  etc. 
In  the  case  of  regression  curves  from  correlation  tables,  a  very 
good  fit  is  often  obtained  by  the  use  of  the  nth-order  parabola, 
but  the  question  of  how  many  terms  to  include  must  frequently 
be  decided  by  trial  and  error. 

A  full  discussion  of  frequency  curves  is  beyond  the  scope  of 
the  present  text.    We  shall  therefore  confine  our  illustration 

__X2_ 

in  the  last  section  to  the  normal  curve  y  =  yoe  ^"^'^  which  is 
already  familiar  to  the  reader. 

'^  3.  Methods  of  Curve-Fitting 

The  first  step  in  anticipation  of  curve-fitting  is  to  plot  the 
observed  data  so  as  to  note  the  trend  of  the  points  and  to  deter- 
mine, if  possible,  the  appropriate  curve  to  use.  Having  chosen 
some  simple  form  such  as  described  above,  it  is  next  necessary 
to  determine  the  approximate  values  of  the  constants  appear- 
ing in  the  equation.  The  methods  used  for  such  determina- 
tion will  depend  upon  the  degree  of  accuracy  required  in  the 
fit.  If  only  a  rough  idea  of  the  trend  is  required,  a  free-hand 
curve  drawn  through  the  observed  points  may  be  sufficient.  For 
more  accurate  results,  however,  it  will  be  necessary  to  apply 
certain  mathematical  methods  known  as  averages,  least  squares, 
or  moments.  The  first  three  of  these  methods  will  next  be 
described,  while  the  method  of  moments  will  be  treated  in 
sections  8  and  9  in  dealing  with  frequency  data. 
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Free-hand  Method 

A  free-hand  curve  drawn  through  the  observed  points  is 
clearly  the  easiest  and  simplest  method  to  employ,  but  as  al- 
ready noted  it  may  give  results  which  are  quite  inaccurate. 
Several  workers,  moreover,  would  not  agree  closely  upon  the 
same  free-hand  graduation. 

In  drawing  a  curve  through  a  series  of  points  the  fitting  is 
often  facilitated  by  the  use  of  curved  pieces  of  celluloid  (French 
curves).  These  may  be  moved  about  as  the  curve  is  drawn  so 
that  the  largest  possible  number  of  observed  points  lie  on  the 
curve  or  deviate  equally  on  either  side. 

It  sometimes  happens  that  the  most  elaborate  mathematical 
methods  fail  to  give  a  good  fit  with  certain  data  over  a  part  of 
the  range.  In  such  cases  it  may  be  desirable  to  resort  to  free- 
hand approximations,  possibly  in  combination  with  the  other 
methods.* 

Method  of  Averages 

A  second  and  more  accurate  method  of  curv^e-fitting  is  the 
method  of  averages.  If  Y  represents  an  observed  ordinate  and 
Y  denotes  an  ordinate  on  the  fitted  curve,  the  vertical  devia- 
tions Y  —  Y  are  known  as  residuals  (see  Chapter  IX).  It  is 
assumed  in  the  method  of  averages  that  the  ''  best "  fit  is 
that  which  makes  the  algebraic  sum  of  the  residuals  equal 
to  zero. 

In  the  case  of  a  straight  line  Y  =  a  +  bX  the  above  condition 
requires  that 

2(7  -  7)  =  2(7  -a-  hX)  =  0, 

or  IIY  -Na-  b^X  =  0.  (174) 

By  dividing  data  into  two  parts,  two  equations  of  this  type  may 
be  formed  and  solved  for  the  constants  a  and  b. 

♦  For  an  example  of  this  sort  see  an  article  by  the  author,  "On  the  Relation  of 
Vital  Capacity  to  Certain  Psychical  Characters,"  Biomdrika,  Vol.  XVI.  p.  140.' 
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While  the  method  of  averages  may  be  used  with  functions 
involving  several  constants,  it  will  be  found  most  convenient 
when  applied  to  the  straight  line  where  only  two  constants  are 
required.   This  method  will  be  illustrated  in  detail  in  section  5. 

Method  of  Least  Squares 

The  third  and  one  of  the  best  methods  of  curve-fitting  is 
known  as  the  method  of  least  squares.  This  procedure,  it  will 
be  recalled,  has  already  been  used  in  Chapter  IX  in  obtaining 
the  equations  of  the  regression  lines.  Further  illustration  will 
now  be  given  for  the  rzth-order  parabola. 

If  Y  represents  a  value  on  such  a  parabola  and  Y  an  observa- 
tion, the  problem  is  to  find  values  of  the  constants  Co,  Ci,  C2  •  •  • 
Cn  such  that  the  sum  of  the  squares  of  the  residuals  (7—7) 
is  as  small  as  possible,  that  is,  to  have 

2^  =  2(7  -  Co  -  CiX  -  C2Z2 CnX'^y  =  a  minimum. 

This  is  accomplished  by  equating  to  zero  the  partial  deriva- 
tives* of  u  with  respect  to  Co,  Ci,  C2  •••  Cn  and  thereby  ob- 
taining n-\-l  equations  for  the  solution  of  the  n-\-l  constants. 
Differentiating  in  this  way  and  setting  the  obtained  results 
equal  to  zero,  we  find 

~  =  2 2(7  -  Co  -  CiX  -  C2Z2 CnX^)(-  1)  =  0, 

c'Co 

^^  =2  2(7-Co-CiX-C2X2 CnX")(-X)  =  0, 


aci 

du 
du 


-=22(7-Co-CiX-  C2X2 CnX'')  (-  X2)  =  0, 


=  22(7-Co-CiZ-C2X2 C„Z")(-Z")  =  0; 


CCn 

or,  rearranging, 

*  Seeanygoodcalculus.suchas  W.  A.  Granville,  Differential  and  Integral  Calculus. 
Ginn  and  Company. 
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Sy=  CoS(l)  +  CiS (X)  +  C22(^2)  -f  .  .  .  +  CnSCJ'^),  (175a) 

2jy=  CoS(^)  +  CiS(^2)  +  C2S(^3)  +  .  .  .  +  CnS(^"+i),    (175b) 

S^2y  =  Co2(^2)  ^  Ci2(^3)  _!_  c2S(Jr4)  +  •  .  .  -f  CnS(Ar''  +  2)^    (175c) 

S^"7-=  Co2(J")  +  CiS(A'"  +  l)  +  C22(^"  +  2)+  .  .  .  +CnS(^2n)^  (175d) 

{Normal  equations,  unweighted  ordinates}  \ 

where  2(1)  —  the  number  of  ordinates  summed.  \ 

These  last  expressions  are  known  as  the  normal  equatians,  of 
which  there  are  clearly  ri  +  1  in  number.  The  variable  Y  is;given 
by  the  observed  ordinates  taken  from  a  convenient  origin,  while 
X  may  also  be  measured  as  the  deviation  •  •  •,  —  3,  —  2,  —  1, 
0,  1,  2,  3,  •  •  •  from  any  arbitrary  point.  The  quantities  2  7, 
2X7,  2X^7,  etc.,  are  found  in  the  manner  illustrated  in  sec- 
tion 7,  and  the  n-\- 1  resulting  linear  equations  are  then  solved 
for  Co,  Ci,  C2  •  •  •  Cn. 

In  the  above  case  it  has  been  assumed  that  the  ordinates  7 
are  to  be  given  equal  weight.  If  these  are  obtained  from  the 
means  of  arrays,  however,  the  frequencies  /x  may  need  to  be 
taken  into  account.    It  is  then  necessary  to  make  the  sum 

2(/x7  -  Co/x  -  Ci/xZ  -  C2/xX2 CnfxX-y  =  E  minimum, 

giving  rise  to  the  following  normal  equations: 

j:f,Y  =  CoS(/x)  +  CiS(/.A^)  +  C2^(fxX^) 

+  •  •  •  +  C„2(/x;^"),  (176  a) 

llf.XY  =  Co^(fxX)  +  C{L(f,X^)  -h  C2Z(fxX^) 

+  .-.+  CnS(/,;f"  +  i),  (176b) 

S/,^"7  =  Co^(hX^)  +  CiS(/,^"  +  i)  +  C2S(/.^"  +  2) 

+  •  •  •  +  CnS(/.^2n).  (176c) 

{Normal  equations,  weighted  ordinates} 

To  distinguish  these  two  methods  the  former  is  said  to  be 
based  on  unweighted,  and  the  latter  on  weighted,  ordinates.  Both 
methods  will  be  fully  illustrated  in  section  7. 
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In  comparing  the  fit  of  two  or  more  curves  to  a  given  body 
of  data,  a  good  test  is  furnished  by  finding  the  squared  or  mean 
squared  sum  of  the  residuals,  that  is. 


i:(Y-Y)^     or 


N 


// 


'  4.  Illustration  of  the  Free-hand  Method 

As  an  example  of  the  free-hand  method,  we  have  selected  a 
series  of  seven  observations  given  at  the  right  of  Fig.  70.    The 
curve  was  so  drawn  as  to  let  the  points  deviate  about  equally 
on  either  side. 
Y 


Y 

X 

5 

1 

10 

2 

22 

3 

35 

4 

55 

5 

73 

6 

101 

7 

Fig.  70.  A  free-hand  curve  drawn  through  seven  observed  points 


If  an  approximation  to  the  equation  of  such  a  curve  is 
desired,  it  may  often  be  found  by  rectification.  Thus  if  the 
desired  equation  has  the  form 


/(F)  =  a  +  bF(X)* 


(177) 


*  The  symbols  f{Y)  and  F(X)  mean  a  function  of  Y  and  a  function  of  X.  See 
Chapter  III,  section  5. 
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we  may  rectify  this  equation  by  substituting  Y' =  f(Y)  and 
X'  =  F{X).    The  result, 

Y'  =  a  +  bX\  (178) 

will  then  be  a  straight  line  by  means  of  which  the  constants 
a  and  6  can  be  determined.    The  form  of  the  original  function 

Y=Y 


Y 

A-2 

5 

1 

10 

4 

22 

9 

35 

16 

55 

25 

73 

36 

101 

49 

0  10  20  30  40     X=X^ 

Fig.  71.   Illustrating  the  method  of  rectification 

(177)  must,  of  course,  be  guessed,  but  if  a  straight  line  results 
from  (178)  the  choice  is  justified. 

In  the  above  problem  it  looks  as  if  the  desired  equation  might 
be  a  parabola  of  the  form 

Y  =  a  +  bX^.  (179) 

Setting  Y'  =  Y  and  X'  =  X-,  we  may  then  find  the  plot  of 
Y  and  X^  to  see  if  a  straight  line  is  obtained. 

The  graph  in  Fig.  71  clearly  justifies  the  choice  of  the  parab- 
ola, so  that  it  only  remains  to  obtain  the  constants  a  and  b. 
Since  the  first  and  last  points  appear  to  fall  on  the  line,  we 
may  obtain  approximate  values  for  these  quantities  by  solving 
the  resulting  equations,  5  =  a-\-  b  and  101  =  a-\-  49b,  giving 
a  =  3  and  b  =  2.    The  equation  of  the  parabola  is  then 


Y  =  3  +  2  X'\ 
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The  method  of  rectification  is  useful  not  only  in  justifying 
the  form  of  the  function  assumed  but  in  furnishing  a  simple 
method  of  obtaining  the  necessary  constants. 

5.  Fitting  a  Learning  Curve  with  a  Hyperbola  by 
THE  Method  of  Averages 

The  data  used  for  the  present  illustration  were  interpolated 
from  a  graph*  showing  the  number  of  words  typed  in  four 
minutes,  Y,  for  various  numbers  of  pages  written,  X.  Inspec- 
tion of  Table  83  and  Fig.  73,  where  the  data  are  plotted,  sug- 
gests that  a  hyperbola  might  be  a  good  fit,  and  this  is  the  curve 
employed  by  Thurstone. 

Table  83.    Data  from  L.  L.  Thurstone's  Experiment  in  Typewriting 


Total  Number  of  Pages  Written 


Words  Typed  in  Four  Minutes 
(Average  of  51  Subjects) 


250 

148 

230 

145 

210 

138 

190 

133 

170 

130 

150 

120 

130 

113 

110 

110 

90 

99 

70 

90 

50 

73 

30 

60 

10 

39 

Inasmuch  as  the  curve  does  not  pass  through  the  origin,  it 
will  be  necessary  to  add  a  constant  term  to  the  equation  of  the 
hyperbola  through  (0,  0),  with  the  result  that 

X 


y  = 


a-\-bX 


+  c. 


(171) 


♦  L.  L.  Thurstone,  "The  Learning  Curve  Equation,"  Psychological  Review 
Monographs,  Vol.  XXV,  No.  3  (1919),  p.  45,  Fig.  5.  (Only  the  odd  ordinates  were 
used.) 


326 


STATISTICAL  METHODS  IN  EDUCATION 


The  constant  c  here  represents  the  skill  in  typewriting  which 
the  students  have  at  the  beginning  of  the  experiment. 

The  above  hyperbola  may  be  rectified  by  selecting  a  point 
(Xkf  Yk)  on  the  curve  (or  one  that  looks  as  if  it  might  fall  on 
the  curve)  and  forming  the  differences 

{a-tbX){a-\-hXk) 
The  hyperbola  passing  through  the  point  Xk,  Yk  is  therefore 
X 


Yz:Yk "  ^^  "^  ^^*^  "^  5^^  "^  ^^''^^' 


(180) 


Setting  ^ — ^  =  Z,  a  +  bXk  =  m,  and  -(a  +  iXk)  =  n,  we 
I  —  Ik  d 

may  also  write  z  =  m  +  nX,  (181) 

which  is  linear  in  X  and  Z.  Thus  if  the  plot  of  X  and  Z  ap- 
proximates a  straight  line,  the  original  data  will  be  approxi- 
mated by  the  hyperbola  (180),  and  the  equation  of  the  latter 
may  be  obtained  from  (181)  by  determining  m  and  n. 


Table  84.  Showing  the  Calculation  Necessary  for  Rectifying 
THE  Hyperbola  Fitted  to  the  Learning-Curve  Data 


Pages 

X 

y  =  Words 
IN  4  Minutes 

X  -  1 

y-  39. 

^-^  -z 

y  -  39  -  ^ 

250 

230 

210 

190 

170 

150 

13 
12 
11 
10 
9 
8 

148 
145 
138 
133 
130 
120 

12 

11 

10 

9 

8 
7 

109 
106 
99 
94 
91 
81 

.1101 
.1038 
.1010 
.0957 
.0879 
.0864 

130 

110 

90 

70 

50 

30 

10 

7 
6 
5 
4 
3 
2 
1 

113 

110 
99 
90 
73 
60 
39 

6 
5 
4 
3 
2 
1 
0 

74 
71 
60 
51 
34 
21 
0 

.0811 
.0704 
.0667 
.0588 
.0588 
.0476 

Total 

63 

.5849 

28 

.3834 
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The  calculation  for  such  rectification  is  shown  in  Table  84. 
The  first  point  in  the  series  with  coordinates  Xk  =  1,  Yk  =  39  has 

been  selected  for  the  origin.    In  Fig.  72  the  values  Z  =  -^^ — — 

have  been  plotted  with  a  resulting  trend  that  appears  to  be 
fairly  linear.  It  now  remains  to  find  the  equation  of  the  line 
of  best  average  fit. 

By  dividing  the  data 
into  two  parts  and  sum- 
ming over  each,  as  showTi 
in  Table  84,  the  two  equa- 
tions like  (181)  necessary 
for  the  determination  of 
771  and  n  may  be  written 


and 


.5849  =  6m  +  6Sn 
.3834  =  6  m  +  27  n. 


.030 


It  will  be  noted  that  ZX 
is  reduced  to  27  when 
only  six  items  are  used. 
Subtracting  the  second 
equation  from  the  first 
to  eliminate  m,  we  find 
n  =  .00560  and,  by  sub- 
stitution, m  =  .0387.  The  required  straight  line  which  is  shown 
in  Fig.  72  then  has  the  equation 

Z  =  .0387  +  .0056  X. 

The  equation  for  the  hji^erbola  may  now  be  written 


Fig.  72.  Illustrating  the  method  of  rectify- 
ing the  hyperbola  for  Thurstone's  data 


or 


X-  1 
y-39 

Y  = 


=  .0387  +  .0056  X, 
X-l 


+  39. 


(182) 


.0387  -f  .0056  J 
A  list  of  values  for  plotting  equation  (182)  is  given  in  Table  85. 
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Table  85.  Showing  the  Calculation  of  the  Ordinates  for  the 
Hyperbola  and  Test  for  Fit  by  Squared  Differences 


X 

.0056  X 

.0056  X 
+  .0387 

(X-1)^ 
(.0056  X 
+.0387) 

Y  = 

Previous 

Column 

+  39 

Y  -Y 

(Table  84 
FOR  Y) 

(Y-Yy^ 

13 

12 

11 

10 

9 

8 

7 

6 

5 

4 

,3 

.0728 
.0672 
.0616 
.0560 
.0504 
.0448 
.0392 
.0336 
.0280 
.0224 
.0168 
.0112 
.0056 

.1115 
.1059 
.1003 
.0947 
.0891 
.0835 
.0779 
.0723 
.0667 
.0611 
.0555 
.0499 
.0443 

108 

104 

100 

95 

90 

84 

77 

69 

60 

49 

36 

20 

0 

147 

143 

139 

134 

129 

123 

116 

108 

99 

88 

75 

59 

39 

1 

2 
-  1 
-1 

1 
-3 
-3 

2 

0 
2 
_  2 
1 
0 

1 
4 
1 
1 
1 
9 
9 
4 
0 
4 
4 
1 
0 

2(7-7)2=39 

From  Fig.  73,  where  the  hyperbola  has  been  plotted,  the  fit 
appears  to  be  a  very  good  one.    A  numerical  measure  of  fit 

is  shown  in  the  above 
table  by  the  quantity 


or 


13 


=  3. 


10    30   50  70    90  110130  150170190  210  230  250 
Total  pat;es 

Fig.  73.   Thurstone's  data  fitted  with  a 
liyperbola  by  the  method  of  averages 

for  the  same  observations.    The  x" 


This  result  will  be  com- 
pared later  with  that 
obtained  in  the  case  of 
a  logarithmic  gi'owi:h 
curve.   The  size  of 

^(Y-rr 

will     determine     which 

cun^e   is  the  better  fit 

test  is  not  feasible  here. 


owing  to  difficulties  which  are  beyond  the  scoi)e  of  this  text. 
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6.  Fitting  a  Learning  Curve  with  the  Logarithmic 
Growth  Function  by  the  Method  of  Least  Squares 

The  data  in  the  preceding  section  will  next  be  fitted  by  the 
logarithmic  growth  curve 

Y=a-\-bX-\-c\ogX,  (172) 

using  the  method  of  least  squares  with  unweighted  ordinates. 
The  use  of  weighted  ordinates  is  usually  not  necessary,  and  in 
the  above  problem  the  frequencies  are  not  given. 

It  is  now  necessary  to  find  the  values  of  a,  b,  and  c  which  will 
make  the  quantity 

V  =  11  (y—  a  —  bX  —  c  log  Z)2  =  a  minimum. 

The  partial  derivatives  of  v  with  respect  to  a,  b,  and  c  are  next 
formed  and  equated  to  zero,  as  on  page  321.  The  desired 
normal  equations  may  then  be  written  in  the  form 

2(r)  =  a2(l)  -f  6S(^)  -h  cS(log  J),  (183a) 

I,(XY)  =  aS(Z)  +  62(^2)  4.  c2(Jlog  X),  (183  b) 

2(7  log  J)  =  a2(log  J)  +  bll(X\ogX)  +  c2(log  a:)2.  (183c) 

{Normal  equations  for  the  logarithmic  growth  curve} 

These  may  be  solved  for  a,  6,  and  c,  giving  the  constants  neces- 
sary for  the  logarithmic  function  of  least-square  fit. 

The  arithmetic  is  greatly  facilitated  by  a  table  for  sums  such 
as  2 (log  Z),  2(X  log  X),  and  2 (log  X)-,  which  is  given  on  page 
330.  For  a  more  extended  table  of  these  values  the  student 
should  consult  Pearl's  "Medical  Statistics,"  p.  368. 

Upon  examining  equations  (183)  it  is  apparent  that  the  quan- 
tities 2:(X),  Z(y),  Z(Xy),  2:(X"0,  and  2(y  log  X)  need  to  be 
calculated  from  the  data,  the  remaining  sums  being  obtained 
from  Table  86.  The  calculation  of  these  required  sums  is 
shown  in  full  in  Table  87,  where,  it  will  be  noted,  a  check  on 
2  (log  A')  is  obtained. 
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Table  86.  Sums  of  Log  X,  X  Log  X,  and  (Log  X)2 


X 

SCLooX) 

2(XLogA:) 

2(LoG  X)2 

1 

0.00000 

0.00000 

0.00000 

2 

0.30103 

0.60206 

0.09062 

3 

0.77815 

2.03342 

0.31826 

4 

1.38021 

4.44166 

0.68074 

5 

2.07918 

7.93651 

1.16930 

6 

2.85733 

12.60542 

1.7748? 

7 

3.70243 

18.52111 

2.48901 

8   

4.60552 
5.55976 

25.74583 
34.33401 

3.30458 

9 

4.21516 

10 

6.55976 

44.33401 

5.21516 

11 

7.60116 

55.78933 

6.29966 

12  .  .  .,.  .• 

8.68034 

68.73950 

7.46429 

13 

9.79428 

83.22077 

8.70516 

14 

10.94041 

99.26656 

10.01877 

15 

12.11650 

116.90793 

11.40196 

16 

13.32062 

136.17385 

12.85187 

17 

14.55107 

157.09148 

14.36587 

18 

15.80634 

179.68639 

15.94158 

19 

17.08509 

203.98270 

17.57679 

20 

18.38612 

230.00330 

19.26947 

21 

19.70834 

257.76991 

21.01773 

22 

21.05077 

287.30321 

22.81983 

23 

22.41249 

318.62295 

24.67413 

24 

23.79271 

351.74802 

26.57912 

25 

25.19065 

386.69652 

28.53335 

The  calculation  of  2(Z)  and  ^(X-)  is  facilitated  by  the  use 
of  Pearson's  Tables  XXVII  and  XXVIII,  which  give  the  sums 
and  sums  of  powers  of  natural  numbers. 

The  normal  equations  may  now  be  written 

(a)  1,398  =  13  a  +  91  6  +  9.7943  c. 

(b)  11,326  =  91  a  +  819  6  +  83.2208  c. 

(c)  1,187.8  =  9.7943  a  +  83.2208  b  -\-  8.7052  c. 

These  may  be  solved  by  determinants,  but  straightforward 
elimination  is  probably  as  convenient  as  any  method.  The 
complete  solution  is  given  below  for  the  benefit  of  those  stu- 
dents who  have  not  worked  problems  of  this  sort  for  some  time. 
Multiplying  equation  (a)  by  7  and  subtracting  from  (b)  gives 

id)  1540  =  182  6  +  14.6607  c. 
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Table  87.   Showing  the  Formation  of  Sums  Necessary  for  Fitting 
A  Logarithmic  Function  by  Unweighted  Ordinates 


y= Words 

Pages 

X 

IN  4 
Minutes 

XY 

250 

13 

148 

1,924 

230 

12 

145 

1,740 

210 

11 

138 

1,518 

190 

10 

133 

1,330 

170 

9 

130 

1,170 

150 

8 

120 

960 

130 

7 

113 

791 

110 

6 

110 

660 

90 

5 

99 

495 

70 

4 

90 

360 

50 

3 

73 

219 

30 

2 

60 

120 

10 

1 

39 

39 

Total    .    .    . 

91 

1,398 

11,326 

X2 


169 

144 

121 

100 

81 

64 

49 

36 

25 

16 

9 

4 

1 


819 


LogZ* 


1.11394 
1.07918 
1.04139 
1.00000 
0.95424 
0.90309 
0.84510 
0.77815 
0.69897 
0.60206 
0.47712 
0.30103 
0.00000 


9.79427 


YLogX 


164.86312 

156.48110 

143.71182 

133.00000 

124.05120 

108.37080 

95.49630 

85.59650 

69.19803 

54.18540 

34.82976 

18.06180 

0.00000 


1,187.84583 


Multiplying  (a)  by  .75341  and  subtracting  from  (c),  we  find 
(e)  134.5  =  14.6605  h  +  1.3261  c. 

The  terms  involving  h  may  next  be  eliminated  by  multiplying 
{e)  by  12.4143  and  combining  with  (o?),  with  the  result 

(/)  129  7  =  1.8019  c,  or  c  =  71.98. 

By   substitution   and   check   we   also   obtain   a  =  34.67    and 
h  =  2.663. 

The  required  growth  curve  then  has  the  equation 


y=  34.67  +  2.663  X-f  71.98  logX, 


(184) 


and  is  plotted  in  Fig.  74. 

From  Table  88,  where  values  for  plotting  are  computed,  it 
will  also  be  noted  that  the  sum  of  the  squared  differences, 
2(1'  —  Y)^,  is  48,  which  is  not  much  larger  than  that  obtained 
for  the  hyperbola  fitted  to  the  same  data.  In  this  example, 
then,  there  is  little  choice  between  the  two  curves. 


*  These  values  were  read  from  an  ordinary  five-place  logarithm  table. 
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It  should  finally  be  noted  that  quite  different  forms  of  learn- 
ing curves  result  when  the  time  is  recorded  instead  of  the 
amount  learned  per  unit  of  practice  or  time.*   The  data  when 


30   50    70  90  110  130  150170  190  210230  250 

Total  pages 


Fig.  74.   Thurstone's  data  fitted  by  a  logarithmic  growth  curve 
using  the  method  of  least  squares 


Table  88.  Values  for  Plotting  Y  = 

34.67  +  2.663  X  +  71.98  Log  X 

Pages 

X 

2.663  X 

71.98  Log  X 

Ordinate,  Y 

(Y-Yf 

250 

13 

34.619 

80.181 

149.5 

2.25 

230 

12 

31.956 

77.679 

144.3 

.49 

210 

11 
10 

29.293 
26.630 

74.959 
71.980 

138.9 
133.3 

.81 

190 

.09 

170 

9 

23.967 

68.686 

127.3 

7.29 

150 

8 
7 
6 
5 

21.304 
18.641 
15.978 
13.315 

65.004 
60.830 
56.011 
50.312 

121.0 

114.1 

106.7 

98.3 

1.00 

130 

1.21 

110 

10.89 

90 

.49 

70 

4 

10.652 

43.336 

88.7 

1.69 

50 

3 

7.989 

34.343 

77.0 

16.00 

30 

2 

5.326 

21.668 

61.7 

2.89 

10 

1 

2.663 

00.000 

37.3 

2.89 

2(y- 7)2  =  47.99 

*  See  Thurstone's  Monograph  cited  above. 
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recorded  in  time  units  may  be  converted  into  amount  per  unit 
of  time,  as  shown  in  Chapter  VI,  section  8,  and  then  treated  as 
illustrated  above. 

7.  Fitting  a  Growth  Curve  with  a  Cubic  by  the 
Method  of  Least  Squares 

The  following  data  were  obtained  from  a  correlation  table  of 
age  and  ossification  ratio,  the  latter  being  the  quotient  of  the 
ossified  wrist-bone  area  divided  by  the  area  of  a  quadrilateral 
inclosing  the  carpal  bones.  The  subjects  were  520  boys  in  the 
Laboratory  Schools  of  The  University  of  Chicago.  The  meas- 
urements were  made  within  a  few  days  of  each  birthday. 

Table  89.   Data  from  Laboratory  Schools 


Central  Age 

Frequency 

Mean 

Ossification 

Ratio 

19 

3 
13 
39 
54 
84 
63 
48 
44 
38 
36 
30 
24 
21 
15 

8 

1.120 

18 

1.139 

17 

1.091 

16 

1.055 

15 

1.018 

14 

0.971 

13 

0.920 

12 

0.827 

11 

0.757 

10 

0.674 

9 

0.570 

8 

0.499 

7 

0.441 

6 

0  360 

5 

0.261 

Total      

520 

We  shall  fit  a  cubic  to  these  data,  first  by  considering  the 
ordinates  of  equal  weights,  and  then  by  weighted  ordinates, 
using  the  observed  frequencies  as  weights. 

From  equations  (175),  it  is  apparent  that  the  quantities 
2(7)  •  •  •  2(Z-'^y),  and  2(X)  •  •  •  2(X6)  will  be  required.    The 
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arithmetic  is  most  easily  done  on  a  machine  by  the  continuous 
process,  that  is,  multiplying  out  the  sub-products  and  adding 
them  cumulatively  on  the  calculator  without  separate  listing. 
The  complete  work  is  shown,  however,  in  the  accompanying 
table.  It  will  be  noted  that  X  has  been  measured  from  the 
central  age,  12,  which  makes  the  sums  of  the  odd  powers  of  X 
equal  to  zero. 

Table  90.    Showing  the  Formation  of  Sums  Necessary  for  Fitting 
A  Cubic  by  the  Method  of  Unweighted  Ordinates 


X 
Age  — 12 

Y 

XY 

X^Y 

X^Y 

X2 

X3 

Z4 

X5 

XG 

7 
6 
5 
4 
3 
2 
1 
0 

-  1 
-2 
-3 
-4 

-  5 
-6 
-7 

1.120 
1.139 
1.091 
1.055 
1.018 
0.971 
0.920 
0.827 
0.757 
0.674 
0.570 
0.499 
0.441 
0.360 
0.261 

7.840 
6.834 
5.455 
4.220 
3.054 
1.942 
.920 

-  .757 

-  1.348 

-  1.710 

-  1.996 

-  2.205 

-  2.160 

-  1.827 

54.880 

41.004 

27.275 

16.880 

9.162 

3.884 

.920 

.757 

2.696 

5.130 

7.984 

11.025 

12.960 

12.789 

384.160 

246.024 

136.375 

67.520 

27.486 

7.768 

.920 

-  .757 
-  5.392 

-  15.390 

-  31.936 

-  55.125 

-  77.760 

-  89.523 

49 

36 

25 

16 

9 

4 

1 

1 
4 
9 
16 
25 
36 
49 

843 

216 

125 

64 

27 
8 

1 

-  1 

-8 

-27 

-  64 

-  125 
-216 

-  343 

2,401 

1,296 

625 

256 

81 

16 

1 

1 

16 

81 

256 

625 

1,296 

2,401 

16,807 

7,776 

3,125 

1,024 

243 

32 

1 

-  1 

-32 

-243 

-  1,024 

-  3,125 

-  7,776 
-  16,807 

117,649 

46,656 

15,625 

4,096 

729 

64 

1 

1 

64 

729 

4,096 

15,625 

46,656 

117,649 

0 

11.703 

18.262 

207.346 

594.370 

280 

0 

9,352 

0 

369,640 

Equations  (175)  may  now  be  written 

(a)  11.703  =  15  Co  +  280  C2. 

(6)  18.262  =  280  Ci  +  9352  C3. 

(c)  207.346  =  280  Co  +  9352  Co. 

id)  594.370  =  9352  Ci  +  369,640  C3. 

These  may  be  solved  by  elimination,  as  illustrated  in  the  pre- 
ceding section,  giving 

Co  =  +  .8305,  Ci  =  +  .0743,  C2  =  -  .002693, 
and  C3  =  -  .000272. 

The  required  cubic  is  therefore 

Y  =  .8305  +  .0743^  -  .002693^2  _  .000272  J^^        (185) 
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In  order  to  compare  results  with  those  obtained  by  the  following 
method,  the  origin  will  be  shifted  to  age  13.  Taken  from  this 
point,  the  equation  becomes 

y  =  .8305  +  .0743(Xi  +  l)-.002693(Zi  +  l)2-.000272(Xi-fl)3 
or  Y=  .902  +  .0681X1  -  .OOSSl^i^  -  .000272Zi3.         (186) 

Before  plotting  this  result  together  with  the  observed  points, 
the  equation  of  the  cubic  by  the  method  of  weighted  ordinates 
will  next  be  obtained.  The  arithmetic  is  much  lengthier  be- 
cause none  of  the  terms  vanish  as  above.  Table  91  shows  the 
full  calculation  for  the  sums  entering  into  equations  (176),  each 
of  the  totals  being  divided  by  520  to  give  more  convenient 
numbers. 

Forming  equations  (176),  we  find 

(a)  .8534  =  Co  -  .2519  Ci  +  10.663  C2  -  25.106  C3. 
(6)  .5047  =  -  .2519  Co  +  10.663  Ci  -  25.106  C2  +  288.51  C3. 
(c)  7.2352  =  10.663  Co  -  25.106  Ci  +  288.51  C2  -  1295.2  C3. 
{d)  - 1.6374  =  - 25.106 Co+  288.51  Ci - 1295.2 C2  + 11,377 C3. 

The  elimination  is  next  given  in  detail  for  illustration. 
Multiplying  (a)  by  .2519  and  adding  to  (6)  gives 

{e)  .7197  =  10.600  Ci  -  22.42  C2  +  282.19  C3. 

Multiplying  (a)  by  10.663  and  subtracting  (c),  we  obtain 

(/)  1.8646  =  22.42  Ci  -  174.81  C2  +  1027.5  C3, 

and  multiplying  (a)  by  25.106  and  adding  to  {d),  we  find 

{g)  19.788  =  282.19  Ci  -  1027.5  C2  +  10,747  C30 

This  gives  three  equations  in  three  unknowns. 
Terms  in  d  are  next  eliminated  as  follows : 

Qi)  .628  =  -  430.6  C2  +  3235  C3   [(g)  -  26.622  X  {e)\ 
(i)  .1619  =  -  60.23  Co  +  203.61  C3    [.4728  x(/)  -  (e)l 


336 


STATISTICAL  METHODS  IN  EDUCATION 


o 


s 

Eh 

m 


n 


1-4 


Pi 
o 


73 


w 

a  Q 

^^ 

'^    Q 

c«  o 
H   H 

o 

Eh 

O 
[>< 

X 

Eh 

O 

o 
» 


H 
..3 

n 


oc 

m 

'Jf 

CO 

CO 

CO 

^  CO 

Tf 

o 

o 

CO 

m 

CO 

in  t- 

(D 

Cv) 

■^ 

CO 

t- 

CO 

t  CO 

■^ 

00 

o 

c~ 

CO 

in 

o  t- 

CD 

05 

t> 

CO 

CO 

TJ" 

'^1 

X 

o 

t- 

t- 

l-H 

CJ5  CO 

>< 

1 

Oi 

CO 

05 

Oi 

U3 

1   "^^ 

cd" 

CO 

uC 

a 

■^ 

t- 

in  r-t 

< 

w 

o 

ini 

CO 

CO 

CO 

t> 

c- 

CO 

C75 

-H  .-H 

1-H 

N 

l-H 

l-H 

CO 

Oi 

1-H 

q_ 

CO 

05^ 

in 

00 

U3 

«o 

CO 

00 

CO 

T}<  CO 

00 

o 

o 

CO 

in 

Tt 

^  c^ 

N 

CO 

CO 

CO 

00 

CD 

TT  T^ 

•* 

CO 

o 

05 

o 

■^ 

>  05 

>o 

CO 

«£! 

05 

1— t 

CO 

,  CO 

t> 

t> 

q^ 

CO 

l-H 

1-H 

!^ 

1       •^ 

« 

O 

05 

CO 

CJ 

^ 

00 

©■ 

in 

cc 

M 

CO 

CO  CO 

^ 

<N 

■"S" 

CO 

tH 

1 

•  1 

1 

CO 

1 

1 

CD 
1 

m 

CO 

1 

CO 
CO 

1 

C-  1-H 

00 

la 

■* 

■* 

rr 

CO 

rl<  00 

CO 

o 

o 

CO 

in 

00 

in  1-1 

■^ 

00 

(SI 

00 

t- 

■^ 

CO 

TT  O 

^-H 

00 

o 

^^ 

^H 

CO 

M  in 

^ 

00 

05 

co 

CO 

1     ■* 

05 

CD 

q^ 

CO 

O 

t> 

O  00 

CO 

00 

05 

'^ 

I— 1 

1 

CO 

t> 

in 

t-^ 

CO 

CO 

O  X 

l-H 

CO 

CO 

eo 

in  CO 

>-H 

00 

in 

<o 

00 

CO 

CO 

rf  -"^ 

CO 

o 

o 

CO 

in 

CO 

in  CD 

■^ 

CO 

05 

IC 

t~ 

CO 

■<^  O 

o 

CO 

o 

CO 

•^ 

05 

m  o 

CO 

<x> 

^ 

"^ 

"^ 

CO 

1   ■  ^ 

05 

a 

o 

in 

q^ 

q_  -; 

I-H 

CO 

»-( 

1  '  1 

1 

T-H 

1 

CO 

1 

1 

in 

1 

1 

eo"  in 

1-H  CO 

'  1 

00 

u; 

^ 

.o 

CO 

CO 

■^  CO 

-^ 

o 

o 

CO 

m 

CO 

in  eo 

C4 

o 

CO 

CO 

00 

CO 

CO 

,  -^  ^ 

CO 

w 

o 

m 

CO 

l-H 

Tf  CD 

^ 

1-H 

CO 

<£) 

■^ 

CO 

CO 

TI< 

CO 

t> 

t- 

m 

in  CO 

1 

in  d 

l-H 

00 

\a 

CO 

CO 

00 

CO 

-*  CO 

00 

o 

o 

CO 

in 

^ 

-H  OJ 

y-l 

«5 

in 

CO 

CO 

CO 

Tr  t> 

o 

CO 

CO 

CO 

o 

CO 

CO  ^ 

^ 

T-H 

1-H 

l-H 

»— 1 

»-H 

l-H 

l-H 

l-H 

1-1  in 

I  1  1 

1 

1 

1 

1 

1 

1 

1  <^ 

1 

■^ 

o 

UO 

CO 

o 

CO 

CO 

00  00 

00 

o 

o 

CO 

o 

CO 

CO  t- 

«5 

t> 

CO 

05 

Oi 

t> 

00  CO 

CO 

o 

o 

t> 

o 

in 

•^  CO 

>-. 

t> 

00 

-^ 

•—1 

q 

*-H 

CO  1-; 

'—1 

's; 

q 

CO 

CO 

q 

Ti»  q 

UO 

d 

<~c 

00 

Tf 

^ 

,  CO  d 

iii 

tT 

t-^ 

d 

CO 

d 

1-H  l-H 

(N 

lO 

CO 

CO 

00 

CO 

CO  CO 

m 

05 

05 

o 

m 

'-'^    , 

t- 

*l 

c- 

m 

CO 

1  <^ 

CD 

q_ 

-* 

q_ 

<xi 

q_ 

X   1 

^ 

1 

CO 

»-H 

'   1 

1 

1-H 
1 

l-H 
1 

CO 

1 

l-H 
1 

1 

1 

o 

lO 

Tf 

o 

00 

CO 

00  rr 

CO 

o 

o 

CD 

o 

CO 

CO 

CO  in 

>-i 

CD 

t- 

00 

^ 

■^ 

t- 

00  CO 

t- 

o 

o 

05 

o 

CO 

CO  CO 

c^ 

o> 

^H 

t-; 

l> 

q 

^ 

1  ^  ® 

CO 

q 

Tf 

CO 

q 

q 

eo  CO 

^ 

d 

d 

d 

CO 

CO 

^ 

'  CD  i6 

00 

CO 

05 

CO 

•^ 

CO 

CO  t-^ 

«^ 

eg 

t- 

w 

1-H 

'^ 

CO 

CO  —1 

•— t 

t> 

cr5 

CO 

CD 

CO 

CO 

CO 

CO 

ini 

CO 

1-H 

CO 

co 

CO 

CO 

CO 

1-H 

t> 

CO 

t- 

o 

iC 

CO 

o 

■^ 

CO 

00  CO 

CO 

o 

o 

CO 

o 

-^ 

CO  -^f 

>. 

(D 

CO 

05 

^H 

CO 

l> 

00  CO 

05 

c 

00 

CD 

o 

o 

eo  o 

f— 1 

q 

•—1 

q 

q 

^^ 

CO  lO 

t> 

TJ< 

00 

ir: 

X 

t-; 

■^  in 

^ 

d 

TT 

d 

d 

^ 

^ 

CD  t-^ 

CO 

00 

05 

in 

t^ 

CO 

CO 

N 

l~- 

t~ 

c- 

t> 

CO 

CO  ia 

t- 

CO 

in 

in 

CO 

1-H 

CO 

?-H 

?— « 

»-t 

1  1 

1 

1 

1 

1 

1 

1 

N 

Tf 

O 

t- 

05 

o 

CO 

CO 

O  «  CO 

■^ 

o 

CO 

^^ 

o 

» 

Tf  eo 

U> 

o 

T? 

t> 

v^ 

t- 

CO  00  CD 

CD 

o 

t- 

CD 

o 

X 

t-  in 

^ 

C3 

00 

in 

Oi 

lO 

-H 

— H  CO  t> 

CO 

q 

CO 

-^ 

q 

t>  X 

*»-. 

CO 

-tf 

Oi 

CO 

irj 

^ 

T?  CD  00 

T3< 

t> 

^H 

d 

in 

CO 

eo 

1-H 

•^ 

m 

00 

CO 

■V  CO  CO 

CO 

l-H 

l-H 

o 

CO 

CO 

OS 

Tf 

■* 

eo 

00  ^  00 

«> 

o 

Tf 

r^ 

xti 

00 

o  o 

"-> 

>— ( 

eo 

lO 

00 

CO 

Tj<  Tf  eo 

eo 

eo 

CO 

CO 

1-H 

CO  o 
in  q 

l-H 

o 

Oi 

1-H 

lO 

00 

»-l 

o  t-  t- 

■»f 

o 

at 

^H 

o 

»H 

>^ 

CO 

CO 

Oi 

o 

t- 

CO  CO  W5 

t- 

t^ 

05 

Tr 

CO 

CO 

f-H 

1-^ 

q 

q 

q 

f-H 

q 

q  00  t-_ 

t£> 

lO 

Tf 

TJ" 

CO 

CO 

eo 

1^ 

■ 

, 

• 

' 

* 

' 

. 

' 

■ 

' 

' 

* 

' 

iS 

o 

(O 

lO 

•^ 

CO 

CO 

f-H 

O  -H  CO 

eo 

Tf 

in 

CO 

c- 

00 

H 

< 

1  1 

1 

1 

1 

1 

1 

1 

THE  ELEMENTS  OF  CURVE -FITTING 


337 


C2  =  -  .00370. 

Ci  =  .0680  and  Co  =  .9025. 


Multiplying  (h)  by  .13987  and  subtracting  from  (i)  gives, 
finally,  (j)  -0741  =  -  248.9  C3.    /.  C3  =  -  .000298. 
Substituting  this  value  in  {h), 
From  (g)  and  (a),  we  also  find 
The  required  cubic  is  therefore 

Y=  .9025  +  .0680  X  -  .00370  X^  -  .000298  X\        (187) 

It  will  be  noted  that  the  coefficients  in  equations  (186)  and 
(187)  are  in  close  agreement  except  for  the  last  two,  where  no 
great  effect  will  be  produced  except  for  high  values  of  X.  Com- 
parison of  the  two  cubics  is  shown  in  Table  92,  where  values  of 
Y  have  been  tabulated  with  X  taken  from  the  origin  X  =  13. 
The  plot  of  these  results  in  Fig.  75  shows  that  the  only  notice- 
able difference  in  fit  occurs  for  the  high  values  of  ossification 
ratio,  but  the  number  of  cases  in  this  range  is  so  small  that 
no  very  accurate  smoothing  is  to  be  expected.  Experience 
generally  shows  that  the  method  of  unweighted  ordinates  gives 
approximately  as  good  results  as  the  method  of  weighted  ordi- 
nates, except  when  the  weighting  is  very  uneven. 


Table  92.  Values  for  Plotting  Cubics  (186)  and  (187) 


Ordinates  F 

Age 

X 

For  (186) 

For  (187) 

20 

7 

1.113 

1.095 

19 

6 

1.125 

1.113 

18 

5 

1.121 

1.113 

17 

4 

1.101 

1.096 

16 

3 

1.067 

1.065 

15 

2 

1.022 

1.021 

14 

1 

0.966 

0.966 

13 

0 

0.902 

0.902 

12 

-1 

0.831 

0.831 

11 

-2 
-3 

0.754 
0.673 

0.754 

10 

0.673 

9 

-4 

0.591 

0.590 

8 

-  5 

0.508 

0.507 

7 

-6 

0.426 

0.426 

6 

-  7 

0.347 

0.347 

5 

-8 

0.272 

0.274 

4 

-9 

0.203 

0.208 
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4   5   6    7   8  9  10  11 12  13  14 15  16  17 18  19  2021  Age 

Fig.  75.  Plot  of  the  cubics  (186)  and  (187) 


8.  The  Method  of  Moments  Applied  to  Frequency  Data 

In  fitting  frequency  distributions  with  mathematical  curves, 
one  of  the  best  and  most  widely  used  procedures  is  the  method 
of  moments,  developed  for  this  purpose  by  Professor  Pearson. 
The  graduation,  or  fit,  is  obtained  by  equating  the  moments  of 
the  data  to  the  moments  of  the  curve  to  be  fitted. 

If  a  frequency  distribution  be  given  with  frequencies  /i,  /2, 
Szj  '  •  '  ft,  occurring  at  class  values  Xi,  X2,  X3,  •  •  •  Xty  then  the 
sum  fiXi  4-/2X2  +/3X3  +  •  •  •  -i-ftXt  is  called  the  first  moment 
with  reference  to  the  origin  from  which  X  is  measured.  Similarly, 
/iXr  4-/2X2^  +/3X3"  +  •  •  •  4-/<Xf2  is  called  the  second  moment , 

and/iZi^  4-/2X2^  +/3X33  H VftXt"^  is  known  as  the  third 

moment,  etc.  These  quantities  may  be  more  briefly  written 
as  2/X,  2/Z2,  2/X3  •  •  -,  so  that  the 

pih  moment  about  the  origin  =  S/A"^.  (188) 

When  each  of  the  above  moments  has  been  divided  by  N, 

the  result,  _   ^  ^ZfXP_^     r  Moment  coefficient  j 

^         N         \    about  the  origin    j 

has  been  termed  by  Professor  Pearson  a  moment  coefficient 
about  the  origin. 


(189) 
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The  moment  coefficients  about  the  mean  are  given  by  the 
formula 

2/x^      ^f(X-My     /Formula  for  m.)-^ 
^  I  about  the  mean  J 

The  reader  will  note  that  the  moment  coefficient  about  the 
origin  is  denoted  by  i^p,  while  the  moment  coefficient  about 
the  mean  is  given  by  u,,.  Substituting  various  values  for  p  in 
(190),  and  observing  that  2^/  =  N,  we  may  write 

Vo  =  -^  =  l,  (191a) 

VI  =  ^  =  0,  (191b) 

V2  =  ^'  =  o-x2,  etc.  (191c) 

{Moment  coefficients  about  the  mean} 

C'ertain  relationships  between  the  moments  about  the  origin 
and  the  mean  may  be  obtained  by  expanding  (190).   Thus, 

_  p(p-l)(p-2)  ,  1 

1.2. 3  ^       J 

_         _      _      P(P  - 1)  - 

or,  Vp  =  Vp-  pVp  _  iVi  + ^ Vp  -  ^v^ 

_/,(/,- IK/,- 2)  _^^_^^,^...^  (192) 

Since  v^  =  Po  =  1,  we  find,  upon  setting  p  =  1,  2,  3,  and  4  in 
this  last  equation,  that 

vi  =  vi-vi  =  0,  (193a) 

V2  =  P2-vi2,  (193  b) 

V3  =  V3  —  3  V1V2  +  2  vi^,  (193  c) 

and  that  V4  =  V4  —  4viV3  +  6vi^2  — 3vi^  (193  d) 

;  Moment  coefficients  about  the  mean  in  terms  of  those  about  the  origin} 
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By  taking  the  moments  about  the  origin,  we  may  also  write 

i:fxp_i:f(x-\-pi)p 


Pp  = 


N  N 


I  -       I     P{P—  1)  -   2 

or  vp=  Vj,-[-vvv-iVi^^2 —  ^p-2^1 

p(p-l)(p-2)  3 

Transposing  we  then  have 


V^  =  V^  -  pVp-iVi  -  ^-^ '-  Vp-2Vl^ 

_iivr p£ ^V^_3Vi3 ..  (194) 


Substituting  values  of  p  from  0  to  4  gives  the  following  set  of 
equations,  which  may  be  used  as  a  check  on  equations  (193) : 

vo  =  l,  (195  a) 

vi  =  0,  (195b) 

V2  =  V2-Vl^  (195c) 

V3  =  V3  —  3  V1V2  —  vi^,  (195  d) 

V4  =  V4  —  4  V1V3  —  6  Vi2v2  —  vi^.  (195  e) 

(  Moment  coefficients  about  the  mean  in  terms  ^ 
\       of  moments  about  the  origin  and  mean       J 

The  fifth,  sixth,  and  higher  moments  might  be  formed  in  a 
similar  way,  but  Professor  Pearson  *  has  shown  that,  except  for 
very  large  samples,  their  probable  errors  are  too  high  for  the 
results  to  be  of  any  value  in  curve-fitting. 

It  should  be  noted  that  equations  (193)  and  (195)  hold  when 
X  is  measured  from  any  origin,  since  x  =  X  —  M  =  X'  —  M', 
where  X'  =  X  —  A,  A  being  the  arbitrary  origin.  The  moment 
coefficients  about  the  mean  may  therefore  be  obtained  by  choos- 
ing an  arbitrary  point  and  making  subsequent  adjustment  as 
in  the  case  of  the  standard  deviation. 

*  Karl  Pearson,  "Skew  Correlation  and  Non-Linear  Regression,"  Draper's  Re- 
search Memoirs  II.   Cambridge  University  Press,  1905. 
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It  is  now  necessary  to  distinguish  two  types  of  series  which 
may  arise : 

a.  The  data  may  consist  of  a  system  of  isolated  ordinates  as 
in  the  case  of  the  point  binomial.  This  type,  however,  will 
not  be  considered  in  the  present  treatment. 

b.  The  data  may  consist  of  a  system  of  areas  as  in  the  fre- 
quency distribution  of  a  measured  variable.  Here  the  moments 
are  calculated  by  assuming  that  the  areas  are  concentrated  at 
the  class  values  and  corrections  for  equations  (191)  to  (195)  are 
therefore  necessary.  These  adjustments,  which  are  known  as 
Sheppard's*  corrections,  will  next  be  given  and  the  complete 
arithmetic  shown  for  a  distribution  resembling  the  normal 
curve. 

Denoting  the  moment  coefficients  adjusted  for  grouping  by 
Mb  M2,  M3>  ^^d  /X4,  Sheppard's  correction  may  be  written 

H.i  =  vi,  (196a) 

|j.2  =  V2  -  ^  =  V2  -  .083333,t  (196b) 

p.3  =  V3,  (196c) 

|i4  =  V4  -  ^  V2  +  2io  =  V4  -  .5  V2  +  .02916667.  (196d) 

J  Moment  coefficients  about  the  mean  adjusted  for  grouping  1 
^  (Sheppard's  corrections)  J 

The  proof  of  these  equations  is  based  on  the  assumption  that 
the  derivatives  of  the  frequency  function  vanish  at  the  limits  of 
the  curve.  The  corrections  are  to  be  used  therefore  when  the 
distribution  has  ''high  contact"  at  the  extremes  of  the  scale, 
that  is,  tapers  off  gradually  at  both  ends. 

Professor  Karl  Pearson  has  developed  a  number  of  curves  for 
the  purpose  of  describing  biometric  data.  These  curv^es,  which 
vary  from  extremely  skewed  to  symmetrical  types,  are  identified 
by  certain  criteria  worked  out  from  the  distributions  to  which 

*  W.  F.  Sheppard,  Proceedings  of  the  London  Mathematical  Society,  Vol.  XXIX,. 
pp.  353-380.  _  ,_— 

t  Note  that  a  =  VM2 h  =  {  Vj^T— t^) h. 
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the  curves  are  to  be  fitted.    Some  of  the  constants  used  by 
Professor  Pearson  may  be  set  down  as  follows: 

Pi  =  ^»  Ki  =  2p2-3Pi-6, 

^^  (197) 

Bo  =  Jii,  Pi(p2  +  3)^ 

*^'      1^2^  '      4(4p2-3Pi)(2p2-3pi-6) 

{Pearson's  constants  for  curve-fitting} 

It  will  be  noted  that  /3i  and  ^2  are  independent  of  the  units  of 
measure  of  the  distributed  variables. 
The  steps  in  curve-fitting  are  then  briefly  as  follows : 

1.  Work  out  the  first  four  adjusted  moment  coefficients, 

Ml,  /X2,  Ms,  and  /X4. 

2.  Form  /5i,  ^2,  ki,  and  /C2,  in  order  to  determine  which  type 
of  curve  to  employ. 

3.  Find  the  constants  of  the  curve  selected  from  the  mo- 
ments and  the  /3's  (formulas  for  the  maximum  ordinate  and 
other  parameters  are  given  in  Elderton*  for  each  type  of  curve). 

4.  Plot  the  curve  with  a  histogram  of  the  data  and  note 
the  general  goodness  of  fit. 

5.  Test  the  goodness  of  fit  by  the  x^  method,  finding  the  areas 
under  the  curve  by  arithmetical  or  mechanical  integration. 

In  the  following  section  these  steps  will  be  illustrated  by  the 
normal  probability  curve. 


9.  Fitting  a  Normal  Curve  by  the  Method  of  Moments 

The  data  selected  for  graduation  consist  of  the  heights  of  men 
in  the  British  Isles  (see  Table  41,  p.  206).  These  have  been 
chosen  because  they  furnish  a  fairly  good  example  of  normally 
distributed  data  and  illustrate  the  simplest  of  Pearson's  types 
of  frequency  curves. 

♦  See  Elderton's  "Frequency  Curves  and  Correlation."  Jones's  "First  Course  in 
Statistics,"  and  Pearson's  Tables,  Introduction,  for  detailed  discussion  of  these  types 
of  curves. 
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The  criteria  for  the  normal  curve  y  =  yoe  %  which  should 
be  satisfied  if  this  curve  is  appropriate,  are 

pi  =  0  (198) 

and  p2  =  3,  (199) 

while  the  constants  are  determined  by 

c  =  2  fJL2,  (200) 

yo  =  ^=^  (201) 

V2  171X2 

and  M  =  0  =  origin.  (202) 

{Criteria  and  constants  for  a  normal  curve} 

It  is  now  necessary  to  work  out  these  values  from  the  data, 
and  compare  with  those  given  by  equations  (198)  and  (199). 
The  constants  for  the  curve  are  furnished  by  equations  (200), 
(201),  and  (202). 

In  calculating  the  unadjusted  moments  the  arithmetic  may 
be  conveniently  arranged  as  illustrated  by  Table  93  on  page 
344.  Using  equation  (189),  we  find  from  the  values  at  the  bot- 
tom of  the  table  that  vi  =  .020850,  P2  =  6.617239,  Vs  =  0.206057, 
and  Va:  =  137.689109.  Substituting  these  values  in  equations 
(193)  or  (195),  the  unadjusted  moment  coefficients  about  the 
mean  become 

i^i  =  0,  ^2  =  6.616804,  V3  =  -  .207833,  and  v^  =  137.689183. 

The  adjusted  moment  coefficients  may  now  be  found  from 
equations  (196),  giving 

^1  =  0,  )U2  =  6.533471,  /X3  =  -  .207833,  and  /X4  =  134.4099. 

By  substituting  these  last  values  in  equations  (197),  where 
the  general  expressions  for  /3i  and  ^2  are  given,  we  find 

I3i  =  .000155 
and  ^2  =  3.14879. 

The  values  for  ki  and  K2  are  not  required  in  fitting  the  normal 
probabiHty  curve. 


344 
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Table  93.   Showing  Calculation  of  the  First  Four  Unadjusted 
Moments  of  a  Frequency  Distribution 


1 

Central      |       ^ 
Height       , 

d 

fd 

fd2 

fd3 

fd* 

77A 

2 

10 

20 

200 

2,000 

20,000 

76A 

5 

9 

45 

405 

3,645 

32,805 

75^ 

16 

8 

128 

1,024 

8,192 

65,536 

74^^ 

32 

7 

224 

1,568 

10,976 

76,832 

73A 

79 

6 

474 

2,844 

17,064 

102,384 

72^ 

202 

5 

1,010 

5,050 

25,250 

126,250 

71i^ 

392 

4 

1,568 

6,272 

25,088 

100,352 

70^^ 

646 

3 

1,938 

5,814 

17,442 

52,326 

69A 

1,063 

2 

2,126 

4,252 

8,504 

17,008 

683^ 

1,230 

1 

1,230 

1,230 

1,230 

1,230 

67^ 

1,329 

0 

— 

— 

— 

— 

66^ 

1,223 

-1 

-  1,223 

1,223 

-  1,223 

1,223 

65^6 

990 

-2 

-  1,980 

3,960 

-  7,920 

15,840 

64t^ 

669 

-3 

-  2,007 

6,021 

-  18,063 

54,189 

63^^ 

394 

-4 

-  1,576 

6,304 

-25,216 

100,864 

623^ 

169 

-  5 

-  845 

4,225 

-21,125 

105,625 

61A 

83 

-6 

-498 

2,988 

-  17,928 

107,568 

60^^ 

41 

-7 

-287 

2,009 

-  14,063 

98,441 

593^ 

14 

-  8 

-112 

896 

-  7,168 

57,344 

58j^ 

4 

-9 

-36 

324 

-2,916 

26,244 

57i^ 

2 

-  10 

-20 

200 

-  2,000 

20,000 

Totals 

8,585 

+  179 

56,809 

=  KV2 

+  1,769 

=  NU3 

1,182,061 

Unadjusted  moments 

=  Nu4 

The  probable  errors  of  /3i  and  ^o  for  samples  from  a  normal 
population  are  given  approximately  by 


and 


„  ^     r     /IT  .   -6745  Ve 
P.E.  of  vPi  -  - 


P.E.  of  p2  = 


Va^ 


(203) 


(204) 


We  may  therefore  write  V/3i  =  .012±.018  and  /32  =  3.149±.036, 
and  conclude  that  the  normal  curve  is  appropriate  ev^en  though 
a  value  of  ffo  as  high  as  3.149  is  rather  improbable. 

When  the  goodness  of  fit  is  tested  by  x^  as  in  section  7  of 
Chapter  XIII,  it  is  found  that  the  fit  is  satisfactory.  This  is 
left  as  an  exercise  for  the  student. 
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EXERCISES 

1.  Fit  a  hj-perbola  by  the  method  of  averages  to  the  data  in  the 
accompaming  table.  Use  the  scale  1,  2,  3  •  •  •  for  pages  written,  and 
select  Xk  =  Ij  Yk  =  30  for  rectifying  point. 


Pages  WRiiii..v 

Words  Typed  rv  Focb  Mi.vutes 

370 

192 

350 

1&8 

330 

184 

310 

172 

290 

195 

270 

178 

250 

180 

230 

1&4 

210 

161 

190 

160 

170 

151 

150 

142 

130 

137 

110 

1?W 

90 

106 

70 

100 

50 

81 

30 

57 

10 

30 

\^       .027 


A'  -  1 


H-  .0044  A' 


:  +  30.   Ans. 


2.  Fit  the  data  of  Exercise  1  ^^ith  a  logarithmic  gro^^th  cur\'e, 
using  the  method  of  least  squares.  Compare  the  fit  ^ith  that  ob- 
tained for  the  hyperbola. 

{Y  =  22.56  +  .526  A'  -h  127.1  log  A.   Ans.) 

3.  The  data  on  page  346  are  the  ossification  ratios  of  -540  girls  of 
the  Laboratory-  Schools  of  The  University-  of  Chicago.  Fit  a  cubic 
to  the  means  by  the  method  of  least  squares.  Use  unweighted  ordi- 
nates  and  take  the  origin  at  age  12.) 

4.  Calculate  and  plot  the  means  of  the  columns  from  the  table 
on  page  189.  Fit  a  cubic  to  these  points  by  the  method  of  least 
squares,  using  unweighted  ordinates.  Compare  the  equation  ^ith 
the  follo^^ing,  based  on  more  data :  * 

J  =  23.14  +  1.2545  a:  -  .0089  a2  +  .000025  a^. 

•  See  Memoirs  of  the  National  Academy  of  Sciences,  Vol.  XV,  p.  576. 
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Central  Age 

/ 

Mean  Ossification 
Ratio 

19 

5 

1.160 

18 

14 

1.102 

17 

53 

1.098 

16 

63 

1.108 

15 

69 

1.089 

14 

63 

1.061 

13 

40 

1.033 

12 

44 

.988 

11 

38 

.898 

10 

38 

.834 

9 

39 

.730 

8 

26 

.662 

7 

17 

.523 

6 

23 

.442 

5 

8 
540 

.358 

{y  =  .961  +  .0576  X  -  .00475x2  -  .0000230x3.   Ans.) 

5.  Data :  cephalic  index  of  1982  boys  aged  13  (from  Professor 
Pearson's  laboratory). 


Index 

/ 

Index 

/ 

91 

1 

78 

293.5 

90 

1 

77 

236.5 

89 

4 

76 

181.5 

88 

4 

75 

156.5 

87 

7 

74 

78 

86 

23 

73 

49 

85 

31 

72 

23 

84 

58 

71 

26 

83 

93 

70       

8 

82 

130 

69 

8 

81 

156 

68 

2 

80 

181.5 
227.5 

67 

3 

79 

Total 

1982 

Find  (JL2,  lJi:i,  M4,  /3i,  and  ^2,  and  fit  with  a  normal  curve.  Work  out  the 
chi-square  test  for  goodness  of  fit. 

(M2  =  10.980;  M3  =  2.326;  ^4  =  409.112;  /3,  =  .0041 ;  /^s  =  3.393  ; 
Vo  =  238.62  ;  P  —  .0001,  throwing  together  the  five  highest  and  also  the 
four  lowest  groups.   Ans.) 
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LIST  OF  IMPORTANT  FORMULAS  FOR  REFERENCE 


«-¥■ 

-=¥=-(§> 


Md  =  /./.+ 


Md  =  u,  I 


2 


/. 


up 


L 


md 


fmd 


K 


h. 


N 


G.M.  =  V^i   X2   Xz  • 


log(G.M.)=^21ogW. 


Xn' 


H~ 
M.D.  = 


1  v/1 
N^\X 


2U1 


N 


M.D. 


_  (2  I  fd\)h±{Am  -  M)(Na  -  Nt) 


N 


J        Mean  for 
\ungrouped  series 

}           (5) 

(    Mean  for   "1 
1^  distribution  / 

(6) 

Median  for 

-  distribution  ► 

counting  up 

(8  a) 

•< 

Median  for  dis 

tribution  count 

ing  down 

(8b) 

r  Geometric  \ 
\      mean      j 

(9) 

'  Logarithmic  ^ 
<  form  of  geo- 
^  metric  mean  ^ 

(10) 

{Harmonic  meai 

1}         (11) 

{Mean  deviatioi 

n}       ,    (12) 

•      M 

Mean  deviatio 

for  frequency 

distribution 

n' 
■         (14) 

S.D.  = 


S.D. 


N 


J  Standard  deviation,^      ^^  ev 
\       original  form        J 


'^2 


N 


-  {My. 


'  Standard  devia-  ] 

^  tion  for  reduced  > 

series  J 


(16) 


*  For  notation  see  list  of  important  symbols  in  Appendix  B. 
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S.D.= 


V  /W2 


fd' 


Qs-  Qi 


Zfd 

N 


h. 


Qi  = 

v  = 

Xi  = 


^    f 

u.l. X  /l, 

h 

I  I.  +  : X  h. 


h 


100(7 

M 


0-1 


Sk 


Ml  +  -  (X2  -  Ma). 

(Qs  -  Md)  -  (Md  -  Qi) 

Q 

Qi  +  Qz-2Md 


M  -  Mo 


3  (M  -  Md) 


Pt  =  ?./.+ 


'Standard  de-^ 

viation  for 

distribution 


{Quartile  deviation} 


{ 


Quartiles      1 
for  distribution  J 


P,  =u.l- 


Ru-  Ri 


R,  =  R^-^il^L^(X-U.). 


Rr   = 


_  lOO[h(X-Ll)  +  (fuf,)h] 


Nh 


{Coefficient  of  variation} 

'Transmutation  formula^ 

for  comparable  scores, 

score  form 


'  Measure  of 

skewness 

based  on 

quartiles 


f  Pearson's  measure  )^ 
\       of  skewness       J 

( Approximate  meas- 1 
\     ure  of  skewness     j 


r  Percentiles, 
\  counting  up 


("Counting)^ 
\    down    j 

r  Percentile  ^ 
■I  rank  formula,  [ 
[        form  1        J 

r  Percentile  rank\ 
\  formula,  form  2  J 


(17) 
(19) 

(20  a) 

(20b) 
(22) 
(23) 

(24) 

(25) 
(26) 

(27  a) 

(27  b) 
(28) 
(29) 
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50/.  ,  lOO(fup)     50/.  ,  „ 
c^x  —  — r; — I IT —  — 77~  T"  ^l' 


N 


r  = 


N 
Xxy 


N 


_        Zxy 


r  = 


V2x2  2^2 


XXY  -  NMxMy 


/Class  valuel 
1^       rank      j 

r      Product-moment 
-J  correlation  coefficient, 
[         original  form 

r Correlation  coefficient] 
-{  in  terms  of  deviations  \- 
L  from  means  J 


/Correlation  coefficient/ 

V(2A'2  —  NMx^)(^Y^  —  NMy^)    1  (based  on  raw  scores)  J 


r  = 


2  zr  -  TxMy 


.  /  Correlation  coefficient  / 
y/CZX^  -  TxMxX^Y^  -  TyMy)     I     equivalent  to  (33)     / 


^fxydxdy 


r  = 


V 

2/.dx2  - 

TV 

^fydy^  ~ 

C^fydyY 

N 

^hc 


{Correlation  coefficient  for  distribution  table] 


^=  ^^^ 


Sy    =    (Jy   VT 


^  =  u—  y- 


O"*  ^x 

X=r—Y-r^My^Mx. 


Regression  line  for 
means  of  columns,  re 
ferred  to  mean  of  table  J 

/  Standard  error  / 
/    of  estimate    / 


r  Regression  line  for  "] 
-j  means  of  rows,  referred  V 
y      to  mean  of  table       J 


r   Regression   1 
<  lines  in  score  y 


349 
(30) 

(31) 

(32) 

(33) 
(34) 


=  -7=-       (35) 


I 


form 


J 


(36) 
(37) 
(38) 

(39) 
(40) 


< 


Regression  lines  in^ 
score  form  and  sym- 
bols on  correlation 
sheet 


y 


(41) 
(42) 
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(Tx      ah 

Oxy  =  T  —  =  — r  • 


(43) 


{Regression  coefficients} 


P.£.  (est.  y)  =  . 6745  <^,V^37^  {^trtlZ^^f^omT}  («> 
P.^.  (est.  ^)  =  .6745  a.VT^y^.  {^ptir^X  f?oLT}(46) 


//,  =  100 


-crVl 


I'  Improvement 


-)  =  100(1  -  VT^Ti).  i  over  chance  in  ^  .^^^ 

/  -^       prediction  by     ^     '^ 


l^a  single  score  ^ 


""        1  +  (n  —  l)ri/*   t  reliability  of  lengthened  tests 


rcn  = 


nri/  ^  /  Spearman-Brown  formula  for  predicting  ^  ^^«>. 

1  reliability  of  lengthened  tests  J  ^  ^ 

_.    C Formula  for  predicting  validity"!    .^-^ 
Vn  +  n(n  -  l)rzz      ^  of  lengthened  tests  j  v^^^ 


nr 


cz 


S  =  R- 


i?12  =  


(n-1) 


TT  =  i?  -  CPT. 


ri2 


Jl  -  rfa+ri'af^'' 


CTl 


r  Multiple-response^     ^-p\ 
\  scoring  formula    j      ^     '^ 

(53) 


r  Correlation"] 
•s  after  selec-  Y 
I       tion       J 


^xy  = 


r Correlation  ratios, 
\      original  form 


'^xy  = 


'nyx  = 


xy 


f  Correlation  ratios  as  "i 
<  quotients  of  two  > 
I  standard  deviations; 


(55) 

(56) 

(60) 
(61) 


f;/.(M,  -  r,)g 

N 


'  Correlation  ratio  ] 
for  means  of  col-  t     (62) 
umns 
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^  ^  f  Correlation  ratios  for  \ 

\     correlation  blank     j 

(64) 


1^v  =  Vf- 


Xy  =  A^  +  {^—j^jh,  |- Means  of  the]        (^^) 

■j  arrays  in  a  cor-  > 

Y,  =  Ay  +  (^^y.  ^''^'""°"   '^^^'^       (66) 

4  047 
^'  -  ^'  <  77^  V(il"  -  "-"XCl  -  -n^)^  -  (1  -  rT  +  1}.      (67) 

^  (Blakeman's  test  for  linearity} 

ViV  Vil2  -  r2  <  4.047.  (68) 

{Blakeman's  short  test  for  linearity} 

V  _v  _L /v       v\  /Corrective  formula  \  x^q,. 

Ys-Ys-\-  (Yt  -  Yt).  I  ^^^  eliminating  age  /  ^^^> 

y  _y    i/'y  _y^Z£.    /Corrective  formula  adjusting^   (70^ 
*  ~~    *       ^   '  (Tt      \  f  or  age  and  heteroscedasticity  /  ^     ^ 

„P,  =  n(n  -  l)(n  -  2)  •  •  •  (n  -  r  +  1).  (71) 

{Permutation  of  n  things  r  at  a  time} 

_  n(n  -  l)(n  -  2)  •  •  ■  (n  -  r  +  1)  _  nPr  .-„. 

"^'  -  1.2-3-       r  "TT'  ^^''^ 

{Combination  of  n  things  r  at  a  time} 

{Point  binomial} 

M  =  np.  {"'^binlmkr'"*}  <''8> 

O"  =  Wnpq.      {Standard  deviation  of  the  point  binomial}        (79) 


X2 


U  = 1 e     2^  •  /  Normal  curve  \  .g^. 

^       V2^(T  1  with  area  =1/  ^^"^ 

y  =  -1^ e~^^  =  y^e~^K  I N"™"'  '^"'■^^ \  (82) 

V2-irff  i^witharea  =  N/ 
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z  = 


_X2 

e    2 


V2  7r 
P.  £.  =  .6744898  0-. 


-         Z\  —  Zi 
1^2  = 


1^2 

1^2        2i 


Z2 


N 


r  Ordinate  of  the  normal  curve,  with  ^ 
\    unit  area  and  standard  deviation    J 

r  Relation  between  ^ 
\      P.E.  ando-       / 

J  Mean  of  a  portion  of  a  normal  curve,  "1 
\  with  unit  area  and  standard  deviation  J 


Mean  of  a  portion] 
of  a  normal  curve, 
with  area  =  N 


P.£.M  =.6745 


^x  ^  J  Probable  error  "1 

y/~i\/  \   of  the  mean    / 


(83) 

(84) 
(85) 

(86) 
(88) 


Probable  error  of  the' 


P.  E.M^-M2  =  V(P.  £:.Mi)2  +  (P.  ^.Ma)^  ]  difference  between  two  f-  (89) 

I    uncorrelated  means 


P.  E.Md  = 


.84535  (T, 


=  1.2533  P.^:. 


M« 


r  Probable  error  "1 
\of  the  median/ 


(90) 


p  E     =  -^"^^^  ^  =  '4769  0-  ^  ^Q^^  p  ^        rProbable  error  of  thej  ,g  jv 
'^        '\/2  N  \/n         *  '    *        L  standard  deviation  j 


P.E.v  = 


.67451 


V2iV 


P.  £:.r  = 


p.  £:.,i  = 


^+nioo 


V  Vli 


r Probable  error"! 


2    I  _ 


J 


=  X2^-  ^  of  ^  with  Pear- 
ly   son's  Tables 


(93) 


.6745  (1  -  r^) 

Viv 

.6745  (1  -  -n^) 


f  Probable  error  of  the "1    /(\a\ 
\  correlation  coefficient  j 


J  Probable  error  of  the^   /'Qr;\ 
\      correlation  ratio      j 


P.£.6,„  =  .6745  — 


0-,  Vl-r^ 


'xy 


y/N 


r.ty.bux  — 


yx 


.6745^ 


V  Vl  -  r^ 


Vn 


'  Probable   errors 
of  regression  co- 
efficients 


,   (96  a) 
(96  b) 
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D   r  _    ^,^^g      0'l-2fe         I  Probable  error  of  higher-order "" 

P.  £.512.;,=  .6/45^^^^^-   I         regression  coefficient         /    (^') 


P.  £.8  = 


2  (.6745) 


V(Tl2  -  r2);  (1  _  Tl2)2  _  (1  _  ^2)2  ^.  1| 

{Probable  error  of  t;^  —  r-} 


P.  £.^  -  5  =  V(P.  £.^)2  +  (P.  £.5)2  -  2  i?.45(P.  E.a)  (P.  E.b). 

■  Probable  error  of  difference  with  correlated  measures : 


(98) 
(99) 


P.E.M,  -  M,  =  V(i>.  E.mJ^  +  (P.E.M^)^  -  2  ri2  P.  E.M,  P.E.m^.  (100) 

Probable  error  of  difiference  between  means  where  correlated; 


P.£.,=  .6745^/(l-I). 


P.E.f^  =  .6745 


\ 


If,  (100 -fp) 


N 


f  Probable  error  of  an"1 
1^  observed  frequency  j 


Probable  error  of  a  per- 
centage frequency 


(101) 


(102) 


ft   ) 


t=i . 


J  Chi-square 
\    function 


(103) 


P.E.np  =  .6745  Vnpq, 


pq 


P.E.p  =  .67i5^^ 


r  Probable  errors  of  the  ^     ^       ^ 
<  mean  and  of  the  pro-  >■ 
I  portion  of  successes   J     (106) 


P, £.ei  (of  indi\'idual  Xx)  =  .6745  CxiVl  -  ri/.  (Ill) 

{Probable  error  of  response  for  Xi) 


P.£.  (of  indi\'idual  Zi-Z2^  =  .6745  V2-ri/-r2//.         (113) 


rst  = 


TU 


^r  1172 II 


J  Spearman's  correction  1  /'i  1  ^^ 

\       for  attenuation       j  UJ-^^; 


a  _  Vl  -  /?!/ 

-     Vl-n/ 

cr2  -  1-2(1  _  ;ew) 

Til  = o 


'  Kelley's  formula  for  ^ 

adjusting  reliability  y 

coefficients  J 


(116  a) 


(116b) 
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y  • 

_  i:f:4x{Y,  -  M. 

y)h 

T  ■ 

N(Tx(yy 

} 

♦  - 

_  i:fydy{Xy    -    M 

x)ft 

T  • 

y 

NCTxCTy 

T 

Ndy 

N 

r  Pearson's  formulas  for  the^      (118  a) 
■<  correlation  coefficient  based  y 


xy 


c^xy  — 


fxfy 


{C  orrelation  coefficient  ^ 
adapted  for  use  with  > 
data  on  a  normal  scale  J 

{Zs  —  25  +  1)  {z's  —  Z's  +  1) 


[S^  {Zs  -Zs  +  1)2]  [2^  (Z's  -  Z's  +  1)2] 

/Pearson's  corrective  formula  for  broad  grouping  1 
\    assuming  normal  distributions  of  the  variates   j 


(119) 


(120) 


ilw  = 


n/ 


N 


3c  ^  2 


CTx 


2/i/(       )         r Correlation  ratio  adapted"! 

— •     <  for  use  with  data  on  a  nor-  >     (121) 

^  [  mal  scale  J 


c^li/x  — 


'^yx 


xc 


J  Correlation  ratio  corrected  1      ^-  ^p\ 
\       for  broad  categories       /     ^       ^ 


xc 


_      /^  A^  .„  /Correlation  of  a  variable"!      /ioq\ 

~  \  "  r  ^^'  ~  '^^  ^  ^^  •  I      with  its  class  value      /     ^^'^'^^ 


J'bis.  = 


1^2  -  Yi  ipq 


{Biserial  r) 


(124) 


P.E. 


.6745lx/^-r^ 


(bis.  r)  — 


Vat 


Probable  error 
of  biserial  r 


(125) 


c  =  J 


N+S'  -N 


Is'  -N 


-V¥- 


First  computa-^ 
tion    form    for  I     (128  a) 
contingency    J 

Second  compu-"1 
tation  form  for  }-     (128  b) 
contingency 
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cC  = 


Txc^yc 


r  Correction  to  the  con-"l 
<  tingency  coefficient  for  I    (129) 
[       broad  grouping       J 


P,E,c  — 


p  =  l- 


.6745    ^ 

^  [  (i  +  4>T  J 


r  Probable  error^ 

<  of  contingency  > 

I     coefficient     J 


(130) 


N(Ar2  _  1) 


■  .7063  (1  -  r") 

*  •  '^  'T  (from  p)  —  / — 


T12.3  = 


T12  —  Ti3  '  T2S 


'Spearman's  formula  1 
based  on  rank  dif-  >      (131) 
ferences  J 

r  Probable  error  of  r1      /iqq'v 
t  from  formula  (132)  j      ^^*^'^^ 

r  Partial-correlation  ] 
-j  coefficient  for  three  >     (134) 
[^         variables  J 


I'12.34  ...n  = 


VCl-rfs)^-^) 

_  T12.M.  .  .  (n-1)  —  nn.34-  •  •  (n-l)^2n.34-  •  •  (n-1) 


V   l^  ~"  ''ln.34.  .  .  (n-l)Jll  ~"  ^2n.34...  (n-l)J 

{Partial-correlation  coefficient  of  the  order  (n  —  2)} 


(135) 


Xi  =  612.34  •  •  •  n^2  +  &13.24  •  •  •  nX^  +•••  +  &!  n.23  •  •  •  (n  -  1)-X'n  +  C.  (139) 

{Regression  equation  for  estimating  Xi  from  the  remaining  (n  —  1)  variables} 

t,                 _  y.               ^1-34  •  •  •  n      f  Regression  coefficient  1  /i  4  a\ 

t7l2.34  •  •  •  n  —  ^12.34  •  •  •  n <       A,            ,        .         on     r  V-*-^^) 

0'2.34  . . .  n      I  01  the  order  (n  —  2)   j 

ai.23  •  •  •  n  =  0-1  V(l  -  4)  (1  -  ^f3.2)  '  '  '  (l  "  ^1^.23  •  •  -  (n  -  !))•  (141) 

{Standard  deviation  of  the  order  (n— 1)} 


P.£:.es/  =  .6745cri.23  .-n. 

{Probable  error  of  estimate} 


(142) 


C  =  Mi-bi2.3A  ■  ■  ■  nM2-6i3.24  •  •  •  nMs 6i„.23  •    •  (n  -  1)M„.    (146) 

{Constant  term  in  regression  equation} 


1  — r^  — rfg  -  rg^g  +  2  ri2ri3r23      cri  VSm      ..  .^>, 


0'1.23  = 


Standard  error  of  second-order  in  terms  of  zero-order  coefficients} 
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^12(1  —  Tj^)  —  TisTzz  —  ''I4r24  +  r34(ri3r24  +  ri4»'23) 


ri2.3*= 


V  (1  ~  ''is  ~  ''14  ~  ''34  "*"  2  '■i3''i4''34)(l  ~  ''23  ~  ''24  ~  ^"34  +  2  r23r24r34) 


S12, 


34 


V  1S134S234 


(151a) 


ri3.24- 


''isC^  ~  ^24)  ~  ''l2''23  ~  ''l4''34  +  ''24Ul2''34  +  ''l4''23) 

V  (1  ~  ''12  ~  ''14  ~  ''24  +  2  ''l2''l4''24)(l  ~  ''23  ~  ''24  ~  ''34  +  2  ''23''24''34) 
513.24 


V  1S124S234 


(151b) 


ri4.23  = 


''14(1  ~  ''23)  ~  ''l2''24  ~  ''l3''34  +  ''23(''l2''34  +  ''l3''24) 

V  (1  ~  ''12  ~  ''13 "~  ''23  +  2  ''i2'"i3''23)(I  ~  ''23  ~  ''24  ~  ''34  +  2  r23r24r34) 

1S14.23 


(151c) 


V  S123S234 
{Second-order  correlation  coefficients  in  terms  of  zero-order  coefficients} 


^12.34  •  •  •  n 

_  ri2.34-(n-l)  —  rin.34-    •(n-l)r2n.34    ••(n-l)  q'l.34--  (n  -  1) 

l-^2V34-(n-l)  0-2.34- .(n-l) 

{Reduction  formula  for  regression  coefficient} 


(152) 


&12.34  =  — 
0-2 


^12(1  —  ^34)  —  ^13^*23  —  yi4''24  +  r34(^13''24  +  ri4r23)l 

1  —  '*23  ~  ''24  ~  ^"34  +  2  r23r24y'34  J 

O"!  S12.34 


0*2    5234 


(153a) 


&13.24  =  — 


''13(1  ~  ^24)  ~  ^12^23  —  ^14^34  4-  r24(?'l2^34  +  ^14^23) 


0-3  _ 

O"!  S13.24 

0-3     S234 


1  —  7'23  "~  ^^24  ~  ^34  +  2  r23''24^34 


(153b) 


&14.23  = 


^14(1  —  ^23)  ~  ^12^24  —  ''I3y'34  +  ^23(^12''34  +  ^13^24) 


0-4 

CTl  S14.23 


1  —  ^23  ~  ^"24  ~"  ''34  +  2  r23r24r34 


(153  c) 


0-4    5234 

{Second-order  regression  coefficients  in  terms  of  zero-order  coefficients} 


0'1.234  =  O"! 


S123S234  —  514  23 


r  Standard  deviation  of 
■{  third-order  in  terms  of  [    (154) 


(1  —  r23)S234  L  zero-order  coefficients 
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Rl{23  • . .  n)  —  Txj^Xi  = 


_        |i        ^1.23- --n 


r  Multiple-correlation  \    /i  c  k\ 
1         coefficient  J    ^       '^ 


l-ifi'(23...„)  =  (l-r?2)(l-rf3.2)(l-'-il23)-(l-rf„.23...(n-i))-  (156) 

{Computation  form  for  R} 


RU2Z 


I z r  Multiple-correlation"] 

..    „s  =  r\r^    ^^^ -!  coefficient  for  equal  y    (157) 

^  \l  +  (n-2)r;        '  I    V      / 


—  2)rxx       [         coefficients 


I 


A  = 


rii 

r2i 

ni       ' 

'•      Tnl 

ri2 

^22 

J'32       * 

'  '      Tn2 

ri3 

^23 

^33       •  ' 

•       Tn3 

Tin 

^2n 

TSn      " 

T^nn 

r  Determinant "] 
<  of  zero-order  I    (160) 
L  coefficients  J 


A  = 


^11     T21     rsi 

^12       ^22       ?'32 
^13       ^23       ^33 


f  Determinant  for  1     /1  ci  \ 
t  three  variables  J    ^       ^ 


A  =  l-r?,-r,«3-4  +  2r.r.r23.    {,1^0? (161)}  W 


^12.34  . . .  n  = 


ri  fc.34  . . .  /j . . .  n  = 


-^ 


12 


VAI1A22 


VAuA^ 

/?l(23...n)  =\/l  —-7 

\  An 

0'l.23  ...  n  =  0-1  Vl-i?2  z=  0-1  ^  /- 

\  All 


&I2.34  .  .  .  n  = 
bl  ft.23  ...  ft  ...  n  = 


—  i4i2Cri 
All     <^2 


All   •'■* 


(163) 
(164) 

(165) 

(166) 
(167) 

(168) 
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Y  =  — ^-—  +  C.  {Hyperbola}  (171) 

a  +  bX 

Y=  a-h  bX  -\-  C  log  X,     {Logarithmic  growth  function}     (172) 

F=  Co  +  CiAT  +  C2X^  +  C3^3  +  .  .  .  +  CnAT".  {nth-order  parabola]  (173) 

27=  Co2(l)  +  Cii:(X)  +  C22(;^2)  +  .  .  .  +  Cn2(Z"),  (175a) 

i:XY=Co^(X)+Ci^(X^)-j-C2^(X^)-\-'  ■  •  +  Cn2(^"+i),    (175b) 

2^27=  Co2(;r2)+  Ci2(^3)_^  C22(Ar4)-f  •  •  •  +  Cn2(Jr"  +  2)^    (175c) 

SZ"r=  Co2(Jr")  +  Ci2(^"  +  i)  +  C22(A^"  +  2)^  . . .  +Cn2(;r2").  (175d) 

{Normal  equations,  unweighted  ordinates} 

2/,y  =  CoWx)  +  Ci2(/x;r)  +  C22(/xAr2) 

+  •  •  •  +  Cn2(/,Z"),  (176a) 

Z/xJTF  =  Co^(f,X)  +  Ci2(^^2)  +  C22(/x^3) 

+  ••-fCn2(/,A'"  +  i),  (176  b) 

2/.Z"y  =  CoWxX")  +  Ci2(/,;r"+ 1)  +  C22(/,a:"+2) 

{Normal  equations,  weighted  ordinates} 

2(y)  =  02(1)  +  62(^)  +  c2(log  J),  (183a) 

XiXY)  =  a2(X)  +  62(^2)  +  c^(^x  log  Z),  (183  b) 

2(7  log  X)  =  a2(log  X)  +  b2(jnog  X)  +  c2(log  X)^.  (183c) 

{Normal  equations  for  the  logarithmic  growth  curve} 

pth  moment  about  the  origin  =  ^fX^.  (188) 

_   _  2/^  J  Moment  coefficient"!     /iqq\ 

^^-"aT'  I    about  the  origin    /    ^^^^^ 

XfxP      Zf{X-My  JFormulaformo-j 

Vp  =  —^ —  =  -^Ll 1_.  -l  ment  coefficients  >      (lyU) 

■'^  ^  [about  the  mean  J 
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Vo  = 


N 


=  h 


vi  =  — =  0, 


V2  = 


—   rr    2 


(Tx^,  etc. 
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(191a) 
(191b) 
(191c) 


{Moment  coefficients  about  the  mean} 
vi  =  vi  —  vi  =  0, 

V2  =  V2  -  Vl^, 


(193a) 
(193b) 

V3  =  V3  -  3  ViV2  +  2  vi3,  (193  c) 

V4  =  V4  -  4  V1V3  +  6  vi2v2  -  3  vi^.  (193  d) 

{Moment  coefficients  about  the  mean  in  terms  of  those  about  the  origin} 


Vo=  1, 
vi  =  0, 

V2  =  V2  -  Vi2, 

Vs  =  V3  —  3  V1V2  —  Vi^, 

V4  =  V4  —  4  V1V3  —  6  vi^V2  —  vi'^. 

J  Moment  coefficients  about  the  mean  in  terms  \ 
\       of  moments  about  the  origin  and  mean       / 


(195a) 
(195b) 
(195c) 
(195  d) 
(195  e) 


KLi=vi,  (196  a) 

|jL2  =  V2  -  ^  =  V2  -  .083333,  (196b) 

|Ji3  =  V3,  (196  c) 

|i4  =  V4  -  ^  V2  +  2I0  =  ^4  -  .5  V2  +  .02916667.  (196  d) 

r  Moment  coefficients  about  the  mean  adjusted  for  grouping! 
\  (Sheppard's  corrections)  J 


Ki  =  2p2-3pi-6, 

Pi(p2  +  3)2 


(197) 


K2  = 


4(4p2-3pi)(2p2-3pi-6) 
{Pearson's  constants  for  curve-fitting} 


APPENDIX  B 

LIST  OF  IMPORTANT  SYMBOLS 

In  the  following  list  the  symbols  are  given  in  the  order  in 
which  they  first  appear  in  the  formulas  of  Appendix  A. 

1.  M  denotes  the  arithmetic  mean. 

2.  S  denotes  the  sum  of  the  items  of  the  sort  indicated. 

3.  X  denotes  a  raw  score  taken  as  a  deviation  from  zero. 

4.  N  denotes  the  size  of  the  sample  or  the  number  of  cases  used. 

5.  A  denotes  an  assumed  mean  or  arbitrary  origin. 

6.  /    denotes  the  frequency  in  a  class  interval. 

7.  d  denotes  a  score  as  a  deviation  from  an  assumed  mean  and 
is  expressed  in  units  of  class  intervals. 

8.  h  denotes  the  width  of  the  class  interval. 

9.  Md  denotes  the  median. 

10.  LI.    denotes  the  lower  limit  of  the  interval  containing  the 
median  in  formula  (8  a). 

11.  u.l.    denotes  the  upper  limit  of  the  interval  containing  the 
median  in  formula  (8  b). 

12.  f up  denotes  the  total  frequency  up  to  the  interval  containing 
the  median. 

13.  fdo  denotes  the  total  frequency  down  to  the  interval  containing 
the  median. 

14.  fmd   denotes   the   frequency   in   the   interval    containing   the 
median. 

15.  G.M.  denotes  the  geometric  mean. 

16.  A'iA'2A^3  '  •  •  Xn  denotes  the  product  of  the  N  values  of  A'. 

17.  H  denotes  the  harmonic  mean. 

18.  M.D.  denotes  the  mean  deviation  of  scores  from  the  arithmetic 
mean. 

19.  |x|  denotes  the  absolute  value  of  x,  where  x  =  X  —  M. 

20.  Am  denotes  the  mid-point  of  the  interval  in  which  M  lies. 

21.  Na  denotes  the  number  of  cases  above  M. 

22.  Nb  denotes  the  number  of  cases  below  M. 

23.  S.D.  denotes  the  standard  deviation. 
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24.  X'   denotes  a  deviation  of  the  score  from  an  assumed  mean, 
that  is,  X'  =  X-  A, 

25.  M'  denotes  the  arithmetic  mean  of  the  X'  scores. 

26.  Q    denotes  the  quartile  deviation. 

27.  Qi  denotes  the  first  quartile,  which  is  the  value  below  which 
one  quarter  of  the  cases  lie. 

28.  ^3  denotes  the  third  quartile,  which  is  the  value  below  which 
three  quarters  of  the  cases  lie. 

29.  u.l.   denotes  the  upper  limit  of  the  interval  containing  Qs  in 
formula  (20  a). 

30.  /. /.   denotes  the  lower  limit  of  the  interval  containing  Qi  in 
formula  (20  b). 

31.  /jo  denotes  the  total  frequency  down  to  the  interval  containing 
Qs  in  formula  (20  a). 

32.  fup  denotes  the  total  frequency  up  to  the  interval  containing 
Qi  in  formula  (20  b). 

33.  /3    denotes  the  frequency  in  the  interval  containing  Q3. 

34.  /i    denotes  the  frequency  in  the  interval  containing  Qi. 

35.  V    denotes  the  coefficient  of  variation. 

36.  o"    denotes  the  standard  deviation. 

37.  Sk  denotes  a  measure  of  skewness. 

38.  Mo  denotes  the  mode. 

39.  Pp  denotes  a  percentile  value. 

40.  p     denotes  the  percentage  of  the  cases  smaller  than  Pp  in 
formulas  (27  a)  and  (27  b). 

41.  //>  denotes  the  frequency  in  the  interval  where  Pp  lies. 

42.  Rx  denotes  the  percentile  rank  of  a  score  X  in  formulas  (28) 
and  (29). 

43.  Ri  denotes  the  percentile  rank  of  the  lower  limit  of  the  interval 
containing  X. 

44.  Ru  denotes  the  percentile  rank  of  the  upper  limit  of  the  interval 
containing  X. 

45.  fx    denotes  the  frequency  in  the  interval  containing  X  in 
formula  (29). 

46.  cRx  denotes  the  percentile  rank  of  the  middle  of  the  interval 
containing  X. 

47.  r  denotes  the  product-moment  coefficient  of  correlation. 

48.  X  and  y   denote  deviations  from  the  respective  means  for  X 
and  y,  that  is,  x  —  X  —  Mx  and  y  =  Y  —  My. 

49.  (Tx  and  ay    denote   the   standard   deviations   for   X   and   7, 
respectively. 
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50.  Tx  and  Ty  denote  the  total  of  the  X  scores  and  the  Y  scores, 
respectively,  that  is,  Tx  =  —  X  and  Ty  =^Y. 

51.  /x   denotes  the  frequency  of  a  column  of  type  x  in  formula  (35). 

52.  fy   denotes  the  frequency  of  a  row  of  type  y. 

53.  fxy  denotes  the  frequency  of  a  cell  common  to  a  column  and 
a  row. 

54.  dx  and  dy  denote  the  deviations  in  class  intervals  from  the 
assumed  means  for  the  two  variables. 

55.  a,  b,  and  c   denote  the  three  parts  of  formula  (35)  and  are 
defined  by  that  equation. 

56.  Sy  denotes  the  standard  error  of  estimate  in  predicting  Y  from 
X  by  a  regression  equation. 

57.  y  and  x   denote  points  on  the  regression  lines  (36)  and  (38), 
respectively. 

58.  Y  and  X  denote  points  on  the  regression  hnes  (39)  and  (40), 
respectively.   It  should  be  noted  that  y  —  Y  —  My  and  x  =  X  —  M^-. 

59.  h  and  k  denote  the  widths  of  the  class  intervals  for  X  and  7, 
respectively. 

60.  byx  denotes  the  regression  coefficient  for  y  on  x. 

61.  bxy  denotes  the  regression  coefficient  for  x  on  y. 

62.  P.E.  denotes  the  probable  error. 

63.  Ip  denotes  the  improvement  over  chance  in  prediction  from  a 
regression  equation  by  a  single  score. 

64.  Tnn    denotes  the  predicted  reliability  coefficient  of  a  test  w 
times  its  original  length. 

65.  Til  denotes  the  reliability  coefficient  or  correlation  between 
two  parallel  forms  of  a  test. 

66.  Ten    denotes  the  correlation  between  a  criterion  and  a  test 
n  times  its  original  length. 

67.  Tcz   denotes  the  average  correlation  between  a  criterion  and 
each  of  several  tests  zi,  zo,  zs  •»  *  Zn  in  formula  (51). 

68.  Tzz    denotes  the  average  intercorrelation  of  the  tests  Z\,  zo, 
23  •  •  •  Zn  in  formula  (51). 

69.  S   denotes  the  score  corrected  for  guessing  by  formula  (52). 

70.  R  denotes  the  number  of  right  responses  in  formula  (52). 

71.  W  denotes  the  number  of  wrong  responses  in  formula  (52). 

72.  C   denotes  a  constant  in  formula  (52). 

73.  n    denotes  the  number  of  choices  in  answering  a  multiple- 
response  test  in  formula  (52). 

74.  /?i2  denotes  the  correlation  after  selection  in  formula  (53^. 

75.  2i  denotes  thestandard  deviation  after  selection  in  formula  (53). 
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76.  T\yx  denotes  the  correlation  ratio  based  on  the  means  of  the 
columns. 

77.  r]xy  denot€s  the  correlation  ratio  based  on  the  means  of  the 
rows. 

78.  (Tay  denotes  the  standard  deviation  of  y  —  yz,  where  yz  denotes 
the  mean  of  a  column. 

79.  (Ty    denotes  the  standard  deviation  of  yz. 

80.  Yx  denotes  the  mean  of  a  column.    It  should  be  noted  that 

81.  e  is  defined  by  l/.(J/y  -  Y_z)-  k^. 

82.  d  is  defined  by  l//3/^  -  A\/2  h2, 

83.  2'  denotes  summations  over  an  array,  for  example,  over  a  row 
or  column  in  the  correlation  table. 

84.  Ax  and  Ay  denote  the  assumed  means  for  X  and  for  Y,  respec- 
tively. 

85.  Ys  denotes  a  variable  corrected  for  age  by  formulas  (69)  and 
(70).  _ 

86.  Ys  denotes  the  mean  at  age  A,  in  formulas  ^69    and  ^70). 

87.  (ji  denotes  the  standard  de\'iation  of  the  array  at  age  A*  in 
formula  (70). 

88.  nPr  denotes  the  permutation  of  n  things  r  at  a  time. 

89.  nCr  denotes  the  combination  of  n  things  r  at  a  time. 

90.  q   denotes  the  probability*  for  the  failure  of  an  event. 

91.  p    denotes  the  probability*  for  the  success  of  an  event. 

92.  n    denotes  the  number  of  independent  events  for  formula  ^77). 

93.  IT  denotes  the  value  obtained  by  di\iding  the  circumference 
of  a  circle  by  its  radius. 

94.  e  denotes  the  base  of  the  Napierian  s>'stem  of  logarithms  as 
used  in  formula  (80). 

95.  yo  denotes  the  ordinate  at  jt  =  0  for  a  normal  cur\'e. 

96.  z  denotes  the  ordinate  of  a  normal  cur\-e  ^ith  unit  area  and 
unit  standard  de\iation. 

97.  1X2  denotes  the  mean  of  the  portion  of  a  normal  cur\*e  b'ing 
between  the  ordinates  Zi  and  zo. 

98.  in2  denotes  the  fractional  part  of  the  area  of  a  normal  cur\'e 
Mng  between  the  ordinates  zi  and  Zo. 

99.  fp  denotes  a  percentage  frequency. 

100.  X*  denotes  the  chi-square  function  given  by  formula  i'103). 
101./'/   and  ft   denote  observed  and  theoretical  frequencies,  re- 
spectively, in  formula  (103). 

102.  ei  denotes  response  error  in  formula  (111). 
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103.  zi  and  22  denote  standard  scores  defined  by  —  and  -^>  re- 
spectively, in  formula  (113). 

104.  Tst    denotes  the  correlation  between  "true"  scores  s  and  t 
which  are  freed  from  the  influence  of  response  error. 

105.  cTxy   denotes  the  correlation  coefficient  corrected  for  broad 
grouping. 

106.  Zs  and  z's  denote  ordinates  on  the  two  scales  of  a  normal 
correlation  surface  as  used  in  formula  (120). 

107.  cr[yx  denotes  the  correlation  ratio  corrected  for  broad  grouping. 

108.  Yxc  denotes  the  correlation  of  a  variable  with  its  class  value. 

109.  q  and  p  denote  the  parts  of  the  unit  normal  curve  to  the 
left  and  right  of  the  ordinate  z  in  formula  (124). 

110.  rbis.  denotes  biserial  r. 

111.  C  denotes  the  contingency  coefficient  in  formulas  (128a)  and 
(128b). 

112.  cC   denotes  the  contingency  coefficient  corrected  for  broad 
grouping. 

in  formula  (128  a). 


2  -^ 

xy 


113.  S'  is  defined  by  2 1- 

fjy 

114.  S  is  defined  by  ^j^  1  in  formula  (128b). 

2 


115.  <|)2  is  defined  by  —  2 


N 


116.  \l/3  is  defined  by  — 


N 


If  -f'fy\ 


Jxjy 

N 

f       JxJy 


=  S-  1  in  formula  (130). 


in  formula  (130). 


^fJyV 

N  / 

117.  p  denotes  Spearman's  rank  difference  correlation  coefl^cient. 

118.  Vx  and    Vy    denote  the  ranks  for  the  A'  and  the   Y  series, 
respectively,  in  formula  (131). 

119.  ri2.34...n  denotes  a  partial-correlation  coefficient  of  the  order 
(n-2). 

120.  rf„34...(„_i)  denotes  the  square  of  the  designated  partial- 
correlation  coefficient. 

121.  612.34  •  •  •  n  denotes  a  regression  coefficient  of  the  order  (u  —  2). 

122.  C  denotes  the  constant  term  in  a  regression  equation  and  is 
defined  by  formula  (146). 
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123.  (71.23 . . .  n  denotes  the  standard  deviation  of  the  order  (n  —  1). 

124.  Si23,  S12.34,  etc.  denote  sums  involving  zero-order  correlation 
coefficients  and  are  defined  by  formulas  (147)  and  (151). 

125.  Ri{23  • .  •  n)  denotes  the  multiple-correlation  coefficient. 

126.  A  denotes  a  determinant. 

127.  Aij  denotes  a  minor  of  a  determinant  obtained  by  deleting 
the  coefficients  in  a  row  and  column  common  to  r^y. 

128.  Aij  denotes  a  cofactor  and  is  equal  to  Aij  with  the  sign  that 
would  be  attached  in  expanding  the  determinant. 

129.  vp  denotes  a  moment  coefficient  about  the  origin. 

130.  vp  denotes  a  moment  coefficient  about  the  mean. 

131.  |ii,  [i2,  1x3,  and  [JL4  denote  adjusted  moment  coefficients  about 
the  mean. 

132.  Pi   and   P2  denote  functions  of  the  adjusted  moment  coeffi- 
cients and  are  used  in  curve-fitting. 

133.  Ki   and   K2  denote  functions  of  jSi  and  (32  and  are  used  in 
curve-fitting. 
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SELECTED  BOOKS  FOR  SUPPLEMENTARY  READING 

A.  Texts  on  Educational  Statistics  : 

1.  Statistical  Methods  Applied  to  Educational  Problems,  by  Harold 

O.  Rugg.   Houghton  Mifflin  Company,  1917. 
A  very  readable  book  on  elementary  methods. 

2.  Statistics  in  Education  and  Psychology,  by  Henry  E.  Garrett. 

Longmans,  Green  &  Co.,  1926. 

A  good  discussion  of  reliability  and  partial  correlation. 

3.  Fundamentals  of  Statistics,  by  L.  L.  Thurstone.   The  Macmillan 

Company,  1925. 
A  clear  presentation  of  elementary  methods. 

4.  Statistical  Method  in  Educational  Measurement,  by  Arthur  S. 

Otis.   World  Book  Company,  1925. 
Contains  a  full  treatment  of  percentile  curves. 

5.  Statistical  Method,  by  T.  L.  Kelley.   The  Macmillan  Company, 

1923. 
An  advanced  book  including  many  important  formulas. 

6.  Essentials  of  Mental  Measurement,  by  W.  Brown  and  G.  Thom- 

son.   Cambridge  University  Press,  London,  1921. 
Discusses   psychophysical   methods   and   the   Spearman   two-factor 
theory. 

7.  Graphic  Methods  in  Education,  by  J.  H.  Williams.    Houghton 

Mifflin  Company,  1924. 
Shows  how  to  prepare  charts  and  diagrams. 

B.  General  Texts: 

1.  Introduction  to  the  Theory  of  Statistics,  by  G.  Yule.    Charles 

Griffin,  London,  1926. 
The  best  general  text,  but  somewhat  difficult  for  beginners, 

2.  First  Course  in  Statistics,  by  D.  C.  Jones.    G.  Bell,  London. 

A  clearly  written  text.  Contains  a  good  discussion  of  frequency  curve- 
fitting. 
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3.  Handbook  of  Mathematical  Statistics^  by  H.  L.  Rietz  and  others. 

Houghton  Mifflin  Company,  1924. 
A  useful  reference  book. 

4.  Mathematical  Analysis  of  Statistics,  by  C.  H.  Forsyth.   John 

Wiley  &  Sons,  1924. 
Clear  treatment  of  interpolation.    Suitable  for  students  with  mathe- 
matical training. 

5.  Mathematical  Theory  of  Probabilities,  by  Arne  Fisher.  The  Mac- 

millan  Company,  1922. 

A  careful  development  of  the  theory  of  probability  and  applications 
to  statistical  problems.    For  advanced  students. 

6.  Frequency  Curves  and  Correlation,  by  W.  P.  Elderton.    C.  and 

E.  Layton,  London,  1927. 
A  good  exposition  of  Pearson's  System  of  frequency  curve-fitting. 

7.  Calculus  of  Observations,  by  E.  T.  Whittaker  and  G.  Robinson. 

D.  Van  Nostrand  Company,  1924. 
An  excellent  text  for  the  advanced  mathematical  student. 

8.  Mathematical  Statistics,  by  Henry  Lewis  Rietz.     The  Open 

Court  Publishing  Company,  Chicago,  1927. 
A  concise,  clear,  and  excellent  monograph.    Especially  recommended 
for  students  who  have  had  calculus. 

C.  Texts  in  Other  Fields: 

1.  Medical  Biometry  and  Statistics,  by  Raymond  Pearl.    W.  B. 

Saunders  Company,  1923. 
A  clearly  written  text  for  students  of  medicine  and  public  health. 

2.  Statistical  Methods,  by  Frederick  C.  Mills.    Henry  Holt  and 

Company,  1924. 
One  of  the  best  books  in  the  field  of  economics. 

3.  Elements  of  Statistics,  by  A.  L.  Bowley.   P.  S.  King,  London, 

1920. 

An  advanced  book  on  economic  statistics,  by  the  most  authoritative 
writer. 

D.  Aids  in  Calculation: 

1.  Tables  for   Statisticians   and   Biometricians,    edited   by   Karl 
Pearson.    Cambridge  University  Press,  London,  1924. 
New  edition  forthcoming. 
The  best  tables  for  advanced  work. 
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2.  Tables  of  Vl  -  r^  and  1  -  r^,  by  J.  R.  Miner.   The  Johns  Hop- 

kins Press,  1922. 
Every  student  with  access  to  a  calculation  machine  should  have  these 
tables. 

3.  Barlow's  Tables  of  Squares,  etc.  (1-10,000).    E.  and  F.  Spar, 

London.    (May  be  obtained  at  The  University  of  Chicago 
Bookstore.) 
The  classical  handbook. 

4.  Tables  of  Applied  Mathematics  in  Statistics,  by  J.  W.  Glover. 

George  Wahr,  Ann  Arbor,  Michigan,  1924. 
A  valuable  aid  for  the  actuary  and  advanced  student. 

5.  Statistical  Tables  for  Students  in  Education  and  Psychology, 

by  Karl  J.  Holzinger.   The  University  of  Chicago  Press,  1925. 
Adapted  for  classroom  use. 

6.  Probable  Errors  of  the  Correlation  Coefficient,  by  Karl  J.  Hol- 

zinger.  Cambridge  University  Press,  London,  1925. 
Four-place  values  with  proportional  parts. 

7.  Chambers's  Mathematical  Tables.    W.  R.  Chambers,  London, 

1921. 
Contains  seven-place  logarithm  tables. 

8.  Five-Place  Logarithmic  and  Trigonometric  Tables,  by  James  M. 

Taylor.    Ginn  and  Company,  1905. 
A  clearly  printed  and  convenient  set  of  tables. 
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Age,  corrective  formula  for  eliminating, 
185  f. 

Analysis  of  classified  data,  7 

Area  of  normal  curve,  209  f. 

Arithmetical  mean,  48;  calculation  of, 
79  flF. ;  properties  of,  83  f. ;  reliability 
of,  85 

Arithmetical  progression,  47 

Attenuation,  Spearman's  correction  for, 
253 

Averages,  78  ff. ;  method  of,  in  curve- 
fitting,  320  f.,  325  ff.  See  also  Arith- 
metical mean.  Geometrical  mean, 
Harmonic  mean,  Median,  Mode 

Ayres,  Leonard  P.,  10,  26 

Bar  diagrams,  38  f. 
Binomial  distribution,  190  flf. 
Binomial  law,  experimental  verification 

of,  199  f. 
Biserial  r,  271  ff. 

Blakeman's  test  for  linearity,  183,  267 
Burgess,  William  R.,  11  f.,  299 
Burt,  Cyril,  305  ff. 

Calculation,  of  statistical  constants,  7 ; 
errors  in,  65  ff. 

Card,  data,  20 

Central  tendency,  variations  in,  79 

Characters,  in  statistical  series,  12  f. ; 
ordered  and  unordered,  13  ;  continu- 
ous and  discontinuous,  14  ;  classes  of, 
22;  static  and  dynamic,  75;  methods 
of  correlation  for  two,  256  ff. 

Chi-Square  Test,  Pearson's,  245  fT. 

Class  limits,  23  f. 

Class  values,  24,  80  ;  percentile  rank  of, 
138  f. 

Classification  of  data,  9  flF. 

Class! fior,  25  ff. 

Coefficient,  of  variation,  116  ff. ;  product- 
moment  correlation,  143  ff. ;  regres- 
sion. 159,  161;  reliability,  168  ff. ; 
validity,   168;    probable  error  of,  of 


variation,  238  ;  probable  error  of  cor- 
relation, 238  ;   of  contingency,  273  ff. 

Cofactor  in  a  determinant,  312 

Collection,  units  of,  11  f. 

Column  diagram,  36  f. 

Combinations,  191 

Comparable  measurements,  118  ff. 

Compensating  errors,  66 

Constants,  statistical,  7 

Contingency,  coefficient  of,  273  ff. 

Coordinates,  40  ff. 

Correlation,  linear,  141  ff. ;  Spearman's 
theorem  on,  168;  Spearman-Brown 
prophecy  formula,  169  L  ;  effect  of  se- 
lection upon,  172  ;  non-linear,  177  ff. ; 
methods  of,  for  two  characters,  256  ff. ; 
partial,  283  ff. ;   multiple,  307  ff. 

Correlation  coefficient,  product-moment, 
143  ff. ;  computation  of,  146  ff. ;  in- 
terpretation of,  163  ff. ;  probable  er- 
ror of,  238 

Correlation  ratio,  177  ff. ;  probable  er- 
ror of,  239 ;  for  qualitative  and  un- 
ordered series,  266  f. 

Courses  in  experimental  and  statistical 
method,  2 

Crude  mode,  90  f. 

Cumulative  frequency  curve,  129  ff. 

Cumulative  frequency  distribution,  for 
Otis  Test,  28 

Curve-fitting,  elements  of,  317  ff. 

Curves,  44 ;  normal  probability,  44  f., 
204  ff. ;  types  of,  318  f. ;  fitting  nor- 
mal, by  method  of  moments,  214  ff., 
342  ff. ;  criteria  and  constants  for  nor- 
mal, 343 

Data,  in  statistical  investigation,  3  f. ; 
collection  and  analysis  of,  6  f . ;  col- 
lection and  classification  of,  9  ff. ; 
primary  and  secondary,  9  f. ;  methods 
of  collecting,  14  ff. ;  arrangement  of, 
19  ff. ;  range  of,  22 ;  tabular  and 
graphical  presentation  of,  31  ff. ;  cal- 
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culation  of  mean  for  ungrouped,  79 ; 
fitting  normal  curve  to  frequency  dis- 
tribution of,  214  ff, ;  representing,  on 
a  normal  scale,  221  fT. 

Data  card,  20 

Determinants,  solution  by,  312  flf. 

Deviates  of  normal  curve,  209  f. 

Deviation,  mean,  102  ff. ;  standard, 
108  ff.;   quartile,  110  ff. 

Diagrams,  presentation  of  results  in, 
7  f. ;  purpose  of,  31  f. ;  column  and 
bar,  36  ff. ;   scatter,  141  f. 

Discontinuous  series,  14 

Dispersion,  variations  in,  79  ;  measures 
of,  101  ff. ;  comparison  of  measures 
of,  113  ff. 

Distribution,  probable  errors  of  certain 
constants  for  a  normal,  237  ff.  See 
also  Binomial  distribution.  Cumula- 
tive frequency  distribution.  Simple 
frequency  distribution 

Efficiency,  teaching,  15 

Enumeration  in  problems,  14 

Errors,  absolute  and  relative,  65  f. ; 
biased  and  unbiased,  66  ff . ;  in  edu- 
cational measurement,  74  ff. ;  re- 
sponse, 75;  of  estimate,  161  f. ; 
sampling,  in  the  mean,  232  ff. ;  prob- 
able, of  the  difference  between  two 
means,  235  ff. 

Estimate,  standard  error  of,  159,  161  f. 

Estimation,  in  experimental  work,  15; 
of  teaching  efficiency,  15 

Exponents,  laws  of,  51  f. 

Free-hand  method  in  curve-fitting,  320, 
323  ff. 

Frequencies,  probable  errors  of  observed 
and  percentage,  243  ff. 

Frequency  distribution.  See  Cumula- 
tive frequency  distribution.  Simple 
frequency  distribution 

Frequency  polygon,  37  f. 

Frequency  table,  computation  of  cor- 
relation coefficient  for,  149  ff. 

Function,  hyperbola,  318;  logarithmic 
growth,  318;   nth-order  parabola,  319 

Functional  relationships,  42  f.      ^ 

Geometrical  mean,  49  ;  and  geometrical 

series,  91  ff. 
Geometrical  progression,  48,  91  ff. 


Grouping,  correction  for  broad,  263  ff. ; 
correction  for  fineness  of,  269  f. 

Harmonic  mean,  95  f. 
Heteroscedasticity,  186 
Histogram,  37 
Hollerith  Machine,  20  f. 

Imagination,  constructive,  as  requisite,  3 
Indexing,  numerical  or  verbal  mode  of,  13 
Intelligence  and  attitude,  Tulchin's  data 

on,  268  f. 
Intercorrelations,  288 
Interpolation,  56  ff. 
Interpretation  of  results,  7 

Kelley's  formula  for  adjusting  reliability 

coefficients,  254 
Kurtosis,  variations  in,  79 

Law  of  Statistical  Regularity  for  Large 

Numbers,  16  ff. 
Least  squares,  method  of,  321  ff.,  329  ff. 
Line,  graph  of  straight,  43  f. 
Linearity,  tests  for,  183  f.,  267 
Logarithms,  47  ff, ;    invention  of,  49  f. ; 

laws  of,   52 ;    Briggs  system  of,   53 ; 

four-place  table  of,  60  f. ;    use  with 

rounded  numbers,  71  ff. 

Mc  Call's  method  of  scaling,  226  f. 

Median,  27,  85  ff. ;  probable  error  of,  238 

Mode,  90  f. 

Moments,  338  ff. 

Multiple-response  scoring  formula,  171 

Nomography,  31 

Normal  probability  curve.  See  Prob- 
ability curve 

Ogive  curve,  132 

Ordinates  of  normal  curve,  209  f. 

Otis  Test  Scores,  frequency  distribution 
of,  25  ;  classifier  for,  26  f, ;  cumulative 
frequency  distribution  for,  28 ;  histo- 
gram of,  38 ;  illustrating  the  median 
for,  86 

Partial  correlations,  283  ff, ;  of  first- 
order,  287  ;   of  second-order,  287 

Pearson's  correction  formula,  for  broad 
grouping.  263  ff. ;  for  Spearman's  rank 
coefficient,  279  f. 
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Pearson's  formula  for  product-moment 
correlation,  259  f. 

Pearson's  Tables,  probable  error  of  V 
with.  238 

Percentage  frequency,  probable  error 
of,  243 

Percentile  curves,  131  ff. 

Percentile  method,  127  ff. 

Percentile  ranks,  136  ff. 

Percentiles,  definition  of,  127 ;  com- 
putation of,  128  ff. 

Permutations,  191 

Planning  of  calculations,  7 

Point  binomial,  196;  mean  and  stand- 
ard deviation  of,  197  f. ;  comparison 
of,  and  normal  curve,  212  ff. 

Predictive  value  of  a  test,  168 

Primary  records,  tabulation  for,  6 

Probability,  elementary,  192  ff. 

Probability  curve,  normal,  44  f.,  204  ff. ; 
equation  of,  207  f. ;  area,  ordinates, 
and  deviates  of,  209  ff. 

Probable  error,  211 ;  of  the  difference 
between  two  means,  235  ff. ;  of  cer- 
tain constants  for  normal  distribution, 
237  ff. ;  applications  of  formulas  of, 
240  ff. ;  of  observed  and  percentage 
frequencies,  243  ff. ;  of  an  observed 
proportion,  248  ff. ;  of  biserial  r,  273  ; 
of  contingency  coefficient,  278 ;  of 
correlation  from  ranks,  280  ;  of  (3i  and 
02,  344 

Problem,  planning  study  of,  5 

Product-moment  correlation  coefficient, 
143ff.,258ff. 

Professional  schools  and  statistical 
method,  2 

Progressions,  arithmetical  and  geomet- 
rical, 47  f. 

Proportion,  probable  error  of,  248  ff. ; 
standard  error  of,  248 

Quadrants,  40  f . 

Qualitative  series,  14,  256  ff.,  266  ff. 

Quantitative  series,  14,  256  ff. 

Quartile  deviation,  110  ff, 

Quartiles,    measure  of  skewness  based 

on.  122 
Questionnaires,  15 

Ranks,  correlation  from,  278  ff. 
Records,  tabulation  for  primary,  6 
Rectification,  323  ff. 


Regression,  lines  of,  154  ff.;  meaning 
of,  163  ;  probable  errors  of  coefficients 
of,  239 ;  probable  error  of  higher- 
order  coefficient  of,  239 ;  equation 
for,  292  ff. ;  coefficient  of,  of  third- 
order,  315 

Reliability,  coefficient  of,  168  ff. ; 
Spearman-Brown  formula  for  pre- 
dicting, 169  f.;  Kelley's  formula  for 
adjusting  coefficient  of,  254 

Report,  writing  of,  8 

Residuals,  158,  320 

Response  error,  formulas  for,  250  ff. 

Results,  interpretation  of,  7 ;  presenta- 
tion of,  7  f. 

Rounded  numbers,  arithmetical  com- 
putation with,  69  ff. 

Sample,  random,  18  f. 

Sampling,  16  ff. 

Scaling  of  test  questions,  224  ff. 

Scores,  standard,  168  f. 

Selection,  effect  of,  upon  correlation,  172 

Series,  types  of,  12  ff. ;  quantitative  and 
qualitative,  14,  256  ff. ;  classification 
of,  15;  correlation  ratio  for  qualita- 
tive and  unordered,  266  ff. 

Sheppard's  corrections,  341 

Significant  figures,  68 

Simple  frequency  distribution,  22  ff. ; 
for  Otis  Test  Scores,  25 ;  calculation 
of  mean  from,  79  ff. 

Skewness,  variations  in,  79 ;  measure- 
ment of,  122  f. 

Sorting  by  mechanical  devices,  20  f. 

Source  material,  secondary,  10  f. 

Spearman's  correction  for  attenuation, 
253 

Spearman's  formula  based  on  rank  dif- 
ferences, 278 

Spearman's  theorem  on  correlation,  168 

Standard  deviation,  108  ff. ;  of  point 
binomial,  198  ;   probable  error  of,  238 

Standard  error  of  proportion,  248 

Standard  scores,  168;  in  terms  of 
"true  "  scores  and  response  error,  251  f. 

Standardized  tests,  use  of,  2 

Stanford-Binet  Tests,  168 

Statistical  method,  need  for,  1  f. ;  gen- 
eral requirements  for,  3  ff. ;  procedure 
in  dealing  with  problem,  5  ff. ;  ac- 
curacy in,  65 

Statistician,  capacity  required  for,  4  f. 
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Tables,  presentation  of  results  in,  7  f. ; 

purpose  of,  31  f. ;  construction  of,  32  ff . 
Tabulation,  for  primary  records,  6;    of 

records,  14 ;    by  mechanical  devices, 

20  f. 
Tallying,  22 
Terman     Group      Intelligence      Tests, 

120 
Test  units,  lack  of  equivalence  of,  74 
Tests,  uses  of  correlation  in  evaluating, 

167  ff.;    validity  of,  168;    scaling  of 

questions  in,  224  ff. 
Transmutation  formula  for  comparable 

scores,  121 


"True"  scores,  standard  scores  in  terms 
of,  251  f. 

Validity,  168 

Variability,  measures  of,  101 ;  absolute 
and  relative,  117 

Variables,  independent  and  dependent, 
42 ;  method  of  eliminating  effect  of, 
184  ff. ;  partial  correlation  for  three, 
284 ;  partial  regression  equations  for 
four,  300  ff. 

Variation,  coefficient  of,  116  fif. 

Yule,  G.  U.,  15,  275  f. 
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